We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Eddie - HPC in the cloud | PyData Global 2023
Learn how to scale ML workloads in the cloud using TorchX and Metaflow. Covers distributed training approaches, infrastructure considerations, and developer experience.
-
Cloud computing has become essential for AI/ML workloads, with major companies like OpenAI and Microsoft using massive Kubernetes clusters for model training
-
Three main approaches to distributed ML training:
- Data parallelism: Split data across GPUs, each with full model copy
- Model parallelism: Split model layers across GPUs when model is too large
- Pipeline parallelism: Optimize model parallel training by clever batching
-
TorchX provides high-level APIs for distributed PyTorch training, supporting:
- Local Docker containers
- Kubernetes clusters with Volcano scheduler
- Cloud platforms like AWS Batch
-
Metaflow framework helps:
- Orchestrate ML workflows
- Abstract away infrastructure complexity
- Version experiments and models
- Monitor training jobs and resource utilization
- Integrate with cloud platforms
-
Key scaling considerations:
- Model size (number of parameters)
- Dataset size (e.g. tokens for language models)
- Available compute capacity
- Resource scheduling and queuing
- Cost optimization
-
For organizations building AI systems:
- Need robust compute infrastructure
- Must balance researcher productivity vs operational complexity
- Important to version workflows and experiments
- Should consider managed services vs custom infrastructure
- Focus on good developer/researcher experience
-
Development tools like Minikube allow testing distributed training setups locally before deploying to production clusters