Eddie - HPC in the cloud | PyData Global 2023

Devops

Learn how to scale ML workloads in the cloud using TorchX and Metaflow. Covers distributed training approaches, infrastructure considerations, and developer experience.

Key takeaways

Cloud computing has become essential for AI/ML workloads, with major companies like OpenAI and Microsoft using massive Kubernetes clusters for model training
Three main approaches to distributed ML training:
- Data parallelism: Split data across GPUs, each with full model copy
- Model parallelism: Split model layers across GPUs when model is too large
- Pipeline parallelism: Optimize model parallel training by clever batching
TorchX provides high-level APIs for distributed PyTorch training, supporting:
- Local Docker containers
- Kubernetes clusters with Volcano scheduler
- Cloud platforms like AWS Batch
Metaflow framework helps:
- Orchestrate ML workflows
- Abstract away infrastructure complexity
- Version experiments and models
- Monitor training jobs and resource utilization
- Integrate with cloud platforms
Key scaling considerations:
- Model size (number of parameters)
- Dataset size (e.g. tokens for language models)
- Available compute capacity
- Resource scheduling and queuing
- Cost optimization
For organizations building AI systems:
- Need robust compute infrastructure
- Must balance researcher productivity vs operational complexity
- Important to version workflows and experiments
- Should consider managed services vs custom infrastructure
- Focus on good developer/researcher experience
Development tools like Minikube allow testing distributed training setups locally before deploying to production clusters

Eddie - HPC in the cloud | PyData Global 2023

More talks