Distributed Training Frameworks

Back to Training at Scale

Tools and libraries for training across multiple GPUs and machines.

Key Frameworks

  • DeepSpeed — Microsoft, ZeRO optimizer, efficient large model training
  • FSDP (Fully Sharded Data Parallel) — PyTorch native, shards model/gradients/optimizer states
  • Megatron-LM — NVIDIA, efficient tensor/pipeline parallelism for LLMs

deep-learning distributed-training frameworks