Eddie - HPC in the cloud | PyData Global 2023

Learn how to scale ML workloads in the cloud using TorchX and Metaflow. Covers distributed training approaches, infrastructure considerations, and developer experience.

Key takeaways
  • Cloud computing has become essential for AI/ML workloads, with major companies like OpenAI and Microsoft using massive Kubernetes clusters for model training

  • Three main approaches to distributed ML training:

    • Data parallelism: Split data across GPUs, each with full model copy
    • Model parallelism: Split model layers across GPUs when model is too large
    • Pipeline parallelism: Optimize model parallel training by clever batching
  • TorchX provides high-level APIs for distributed PyTorch training, supporting:

    • Local Docker containers
    • Kubernetes clusters with Volcano scheduler
    • Cloud platforms like AWS Batch
  • Metaflow framework helps:

    • Orchestrate ML workflows
    • Abstract away infrastructure complexity
    • Version experiments and models
    • Monitor training jobs and resource utilization
    • Integrate with cloud platforms
  • Key scaling considerations:

    • Model size (number of parameters)
    • Dataset size (e.g. tokens for language models)
    • Available compute capacity
    • Resource scheduling and queuing
    • Cost optimization
  • For organizations building AI systems:

    • Need robust compute infrastructure
    • Must balance researcher productivity vs operational complexity
    • Important to version workflows and experiments
    • Should consider managed services vs custom infrastructure
    • Focus on good developer/researcher experience
  • Development tools like Minikube allow testing distributed training setups locally before deploying to production clusters