Shagun Sodhani - Training large scale models using PyTorch | PyData Global 2023

Training large-scale models using PyTorch: distributed training techniques for increased speed and memory efficiency.

Key takeaways
  • Distributed Data Parallel: use multiple GPUs to train a single model, increasing batch size and reducing training time.
  • Model Parallelism: split a model into multiple pieces and train them in parallel, increasing memory efficiency.
  • Pipeline Parallelism: split a model into multiple pieces and train them in parallel, increasing memory efficiency.
  • Sharding: split a model into multiple smaller pieces and train them in parallel, increasing memory efficiency.
  • Mixed Precision Training: use lower precision data types to reduce memory consumption.
  • Tensor Parallelism: split a model into multiple pieces and train them in parallel, increasing memory efficiency.
  • Fully Sharded Data Parallel: split the model, gradients, and optimizers into multiple parts and train them in parallel, increasing memory efficiency and speed.
  • FSDP: fully sharded data parallelism with a single PyTorch API.
  • PyTorch supports: distributed data parallel, model parallelism, pipeline parallelism, and fully sharded data parallelism.
  • Large Scale Models: training models that are too big to fit on a single GPU, requiring distributed training.
  • Memory Efficiency: reducing memory consumption by using lower precision data types, sharding, or mixed precision training.
  • Batch Size: increasing batch size by using distributed data parallelism or creating copies of the model.
  • Compression: reducing data size by using compression algorithms.
  • Distributed RPC: alternative way of distributing the workload.
  • Distributed Optimizers: alternative way of distributing the workload.
  • Communication Library: Nickel is Kura’s communication library.