Jay Chia - Blazing fast I/O of data in the cloud with Daft Dataframes | PyData Global 2023

Learn how Daft Dataframes achieves blazing fast I/O from cloud storage, handling 10,000 CSV files in ~1 min & reaching 9 Gbps throughput through Rust & optimized architecture

Key takeaways
  • Daft is a cloud-native dataframe library optimized for fast I/O from cloud storage, particularly AWS S3

  • Key performance achievements:

    • Can read 10,000 CSV files in ~1 minute at 2.5 Gbps
    • Peaks at 9 Gbps network throughput for Parquet files
    • Handles 13GB of data in ~20 seconds at 9 Gbps
    • 300ms file listing vs minutes with naive approaches
  • Technical advantages:

    • Built with Rust for high performance
    • Intelligent file pruning and metadata optimization
    • Native support for complex data types (images, tensors, URLs)
    • Column pruning and projection pushdown capabilities
    • Efficient parallel reads and retry policies for S3
  • Production-ready features:

    • Distributed processing capability through Ray clusters
    • Support for CSV, Parquet, JSON formats
    • Compatible with AWS S3, Azure Storage (basic support)
    • Cost-based query optimizer
    • Integration with technologies like Apache Iceberg, Delta Lake
  • Real-world validation:

    • Used at Amazon for processing petabytes of data daily
    • 38% improvement in workload cost efficiency reported
    • Outperforms PyArrow in certain I/O scenarios
  • Architecture benefits:

    • Eliminates Python GIL limitations through Rust
    • Efficient handling of small files through coalescing
    • Optimized for cloud infrastructure patterns
    • Minimal data movement between nodes
    • Built-in retry policies for cloud storage interactions