Jay Chia - Building Daft: Python + Rust = a better distributed query engine | SciPy 2024

Learn how DAFT combines Python's simplicity with Rust's performance to create a powerful distributed query engine, featuring 2-6x speedups and efficient data processing.

Key takeaways
  • DAFT is a Python data frame library written in Rust that combines Python’s ease of use with Rust’s performance benefits for distributed query processing

  • Key advantages of using Rust with Python:

    • Avoids Python GIL limitations through Rust multi-threading
    • Provides memory stability and efficient resource utilization
    • Enables high-performance native code execution while maintaining Python interface
  • DAFT’s architecture:

    • Core execution happens in Rust with thin Python wrapper layer
    • Uses lazy execution model to optimize query plans
    • Leverages Ray for distributed computing capabilities
    • Supports multimodal data (tables, images, unstructured data)
  • Performance improvements demonstrated:

    • 2-6x speedups by moving computation from Python to Rust
    • Efficient multi-threading through Rust while avoiding GIL
    • Memory-efficient data handling for large-scale processing
  • Integration approach:

    • Simple pip install for Python users
    • Incremental adoption possible in existing workflows
    • Python-friendly API despite Rust internals
    • Works locally on laptop or distributed in cloud
  • Target use cases:

    • Analytics and data engineering
    • Machine learning data preprocessing
    • Large-scale distributed computation
    • Processing terabytes to petabytes of data
  • Positioned as alternative to JVM-based engines (Spark) and local tools (pandas, polars) with focus on Python-first experience while leveraging Rust’s performance