Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars [PyCon DE & PyData Berlin 2024]

Learn how Dask DataFrame 2.0 compares to Spark, DuckDB & Polars. Discover performance improvements, scaling considerations & practical usage across data sizes.

Key takeaways
  • Dask DataFrame 2.0 is now ~20x faster than previous versions due to improved shuffling algorithms, string memory layout, and query optimization

  • Key improvements in Dask 2.0:

    • More efficient string memory layout using PyArrow
    • Faster shuffling with reduced memory overhead
    • Smarter query optimization with predicate pushdown
    • Auto-repartitioning for better performance
    • Improved column projection and filter optimization
  • Benchmark comparisons:

    • Dask outperformed Spark on ~50% of queries
    • DuckDB beats Dask on some queries at smaller scales
    • Polars performs well for medium-sized data on single machines
    • No clear winner across all scales/scenarios
  • Scale considerations:

    • Dask works best for datasets 100GB+
    • DuckDB/Polars excel at single-machine workloads
    • Spark struggles with smaller scale data but works well at larger scales
    • 10TB+ workloads require careful resource management
  • Practical implications:

    • Migration from pandas to Dask is relatively straightforward
    • Distributed processing requires more resources but enables scaling
    • Query optimization is now automatic, reducing manual optimization needs
    • Network transfer remains a significant bottleneck (~200MB/s vs 5-10GB/s in-memory)
  • Future improvements planned:

    • Further optimization of merge operations
    • Reduction of system overhead
    • Better handling of complex queries
    • Improved out-of-core performance