We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
James Bourbeau - Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars | SciPy 2024
Explore Dask DataFrame 2.0's performance gains, competitor benchmarks, and technical improvements. Learn when to choose Dask vs Spark, DuckDB, or Polars for data processing.
-
Dask DataFrame 2.0 brings major performance improvements with up to 20x speedup in certain operations without requiring code changes
-
Key technical improvements:
- PyArrow-backed string data types by default - better GIL handling and memory efficiency
- New peer-to-peer shuffling algorithm reducing memory footprint and complexity
- Query optimizer with column projection and predicate pushdown
- Automatic partition size optimization
-
Benchmarking against competitors using TPCH suite:
- DuckDB performs well for simple queries at smaller scales
- Dask competitive with/outperforms PySpark while using less hardware
- Dask shows consistent performance across varying data sizes (100GB-10TB)
- Polars struggles with larger datasets and complex queries
-
Hardware requirements and scaling:
- PySpark needs significant hardware resources to perform well
- Dask can operate efficiently with about half the memory compared to previous versions
- DuckDB works well for hundreds of GB scale with SQL interface
- Dask excels at 1TB+ scales, especially for complex queries
-
Current limitations and future improvements:
- Query optimizer is still early stage with room for optimization
- Improvements coming to Dask Arrays with similar optimizations
- Focus on reducing overhead in task scheduling
- Working on better handling of out-of-core computations
-
Best tool selection depends on use case:
- DuckDB for SQL interface and smaller datasets
- Dask for large-scale distributed processing
- Consider data size, query complexity, and available hardware