We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Lu Qiu - Maximize GPU Utilization for Model Training | PyData Global 2023
Learn how to boost GPU utilization from 17% to 93% in ML training with distributed caching. Lu Qiu shares insights on optimizing data loading and storage access patterns.
-
Direct loading from remote storage (like S3) can result in >80% time spent in data loading and <20% GPU utilization
-
Alassio Data Platform provides a distributed caching layer between compute frameworks and storage systems, improving data access performance
-
Using Alassio reduced data loader overhead from 82% to 1% and increased GPU utilization from 17% to 93% in benchmarks
-
PyArrow project handles format translation while FileSystem Spec manages storage connections, providing uniform Python I/O interface
-
Distributed caching approach allows storing more cache data compared to local cache solutions, beneficial for large datasets
-
Master/worker architecture can create performance bottlenecks - solved by using consistent hashing for cache distribution
-
System provides model deployment acceleration (up to 10x faster) by optimizing data access and removing unnecessary RPC calls
-
Solution works with multiple clouds and storage systems (S3, Azure, Google Cloud) while preventing vendor lock-in
-
Kubernetes integration allows co-location of training jobs and caching system for optimal performance
-
Platform provides visibility into data usage patterns, cache hit rates, and storage system metrics for optimization