We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Shagun Sodhani - Training large scale models using PyTorch | PyData Global 2023
Training large-scale models using PyTorch: distributed training techniques for increased speed and memory efficiency.
- Distributed Data Parallel: use multiple GPUs to train a single model, increasing batch size and reducing training time.
- Model Parallelism: split a model into multiple pieces and train them in parallel, increasing memory efficiency.
- Pipeline Parallelism: split a model into multiple pieces and train them in parallel, increasing memory efficiency.
- Sharding: split a model into multiple smaller pieces and train them in parallel, increasing memory efficiency.
- Mixed Precision Training: use lower precision data types to reduce memory consumption.
- Tensor Parallelism: split a model into multiple pieces and train them in parallel, increasing memory efficiency.
- Fully Sharded Data Parallel: split the model, gradients, and optimizers into multiple parts and train them in parallel, increasing memory efficiency and speed.
- FSDP: fully sharded data parallelism with a single PyTorch API.
- PyTorch supports: distributed data parallel, model parallelism, pipeline parallelism, and fully sharded data parallelism.
- Large Scale Models: training models that are too big to fit on a single GPU, requiring distributed training.
- Memory Efficiency: reducing memory consumption by using lower precision data types, sharding, or mixed precision training.
- Batch Size: increasing batch size by using distributed data parallelism or creating copies of the model.
- Compression: reducing data size by using compression algorithms.
- Distributed RPC: alternative way of distributing the workload.
- Distributed Optimizers: alternative way of distributing the workload.
- Communication Library: Nickel is Kura’s communication library.