Lu Qiu - Maximize GPU Utilization for Model Training | PyData Global 2023

Learn how to boost GPU utilization from 17% to 93% in ML training with distributed caching. Lu Qiu shares insights on optimizing data loading and storage access patterns.

Key takeaways

Direct loading from remote storage (like S3) can result in >80% time spent in data loading and <20% GPU utilization
Alassio Data Platform provides a distributed caching layer between compute frameworks and storage systems, improving data access performance
Using Alassio reduced data loader overhead from 82% to 1% and increased GPU utilization from 17% to 93% in benchmarks
PyArrow project handles format translation while FileSystem Spec manages storage connections, providing uniform Python I/O interface
Distributed caching approach allows storing more cache data compared to local cache solutions, beneficial for large datasets
Master/worker architecture can create performance bottlenecks - solved by using consistent hashing for cache distribution
System provides model deployment acceleration (up to 10x faster) by optimizing data access and removing unnecessary RPC calls
Solution works with multiple clouds and storage systems (S3, Azure, Google Cloud) while preventing vendor lock-in
Kubernetes integration allows co-location of training jobs and caching system for optimal performance
Platform provides visibility into data usage patterns, cache hit rates, and storage system metrics for optimization

Lu Qiu - Maximize GPU Utilization for Model Training | PyData Global 2023

More talks