Lu Qiu - Maximize GPU Utilization for Model Training | PyData Global 2023

Learn how to boost GPU utilization from 17% to 93% in ML training with distributed caching. Lu Qiu shares insights on optimizing data loading and storage access patterns.

Key takeaways
  • Direct loading from remote storage (like S3) can result in >80% time spent in data loading and <20% GPU utilization

  • Alassio Data Platform provides a distributed caching layer between compute frameworks and storage systems, improving data access performance

  • Using Alassio reduced data loader overhead from 82% to 1% and increased GPU utilization from 17% to 93% in benchmarks

  • PyArrow project handles format translation while FileSystem Spec manages storage connections, providing uniform Python I/O interface

  • Distributed caching approach allows storing more cache data compared to local cache solutions, beneficial for large datasets

  • Master/worker architecture can create performance bottlenecks - solved by using consistent hashing for cache distribution

  • System provides model deployment acceleration (up to 10x faster) by optimizing data access and removing unnecessary RPC calls

  • Solution works with multiple clouds and storage systems (S3, Azure, Google Cloud) while preventing vendor lock-in

  • Kubernetes integration allows co-location of training jobs and caching system for optimal performance

  • Platform provides visibility into data usage patterns, cache hit rates, and storage system metrics for optimization