Jay Chia - Blazing fast I/O of data in the cloud with Daft Dataframes | PyData Global 2023

Rust Python

Learn how Daft Dataframes achieves blazing fast I/O from cloud storage, handling 10,000 CSV files in ~1 min & reaching 9 Gbps throughput through Rust & optimized architecture

Key takeaways

Daft is a cloud-native dataframe library optimized for fast I/O from cloud storage, particularly AWS S3
Key performance achievements:
- Can read 10,000 CSV files in ~1 minute at 2.5 Gbps
- Peaks at 9 Gbps network throughput for Parquet files
- Handles 13GB of data in ~20 seconds at 9 Gbps
- 300ms file listing vs minutes with naive approaches
Technical advantages:
- Built with Rust for high performance
- Intelligent file pruning and metadata optimization
- Native support for complex data types (images, tensors, URLs)
- Column pruning and projection pushdown capabilities
- Efficient parallel reads and retry policies for S3
Production-ready features:
- Distributed processing capability through Ray clusters
- Support for CSV, Parquet, JSON formats
- Compatible with AWS S3, Azure Storage (basic support)
- Cost-based query optimizer
- Integration with technologies like Apache Iceberg, Delta Lake
Real-world validation:
- Used at Amazon for processing petabytes of data daily
- 38% improvement in workload cost efficiency reported
- Outperforms PyArrow in certain I/O scenarios
Architecture benefits:
- Eliminates Python GIL limitations through Rust
- Efficient handling of small files through coalescing
- Optimized for cloud infrastructure patterns
- Minimal data movement between nodes
- Built-in retry policies for cloud storage interactions

Jay Chia - Blazing fast I/O of data in the cloud with Daft Dataframes | PyData Global 2023

More talks