James Bourbeau - Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars | SciPy 2024

Python

Explore Dask DataFrame 2.0's performance gains, competitor benchmarks, and technical improvements. Learn when to choose Dask vs Spark, DuckDB, or Polars for data processing.

Key takeaways

Dask DataFrame 2.0 brings major performance improvements with up to 20x speedup in certain operations without requiring code changes
Key technical improvements:
- PyArrow-backed string data types by default - better GIL handling and memory efficiency
- New peer-to-peer shuffling algorithm reducing memory footprint and complexity
- Query optimizer with column projection and predicate pushdown
- Automatic partition size optimization
Benchmarking against competitors using TPCH suite:
- DuckDB performs well for simple queries at smaller scales
- Dask competitive with/outperforms PySpark while using less hardware
- Dask shows consistent performance across varying data sizes (100GB-10TB)
- Polars struggles with larger datasets and complex queries
Hardware requirements and scaling:
- PySpark needs significant hardware resources to perform well
- Dask can operate efficiently with about half the memory compared to previous versions
- DuckDB works well for hundreds of GB scale with SQL interface
- Dask excels at 1TB+ scales, especially for complex queries
Current limitations and future improvements:
- Query optimizer is still early stage with room for optimization
- Improvements coming to Dask Arrays with similar optimizations
- Focus on reducing overhead in task scheduling
- Working on better handling of out-of-core computations
Best tool selection depends on use case:
- DuckDB for SQL interface and smaller datasets
- Dask for large-scale distributed processing
- Consider data size, query complexity, and available hardware

James Bourbeau - Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars | SciPy 2024

More talks