We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars [PyCon DE & PyData Berlin 2024]
Learn how Dask DataFrame 2.0 compares to Spark, DuckDB & Polars. Discover performance improvements, scaling considerations & practical usage across data sizes.
-
Dask DataFrame 2.0 is now ~20x faster than previous versions due to improved shuffling algorithms, string memory layout, and query optimization
-
Key improvements in Dask 2.0:
- More efficient string memory layout using PyArrow
- Faster shuffling with reduced memory overhead
- Smarter query optimization with predicate pushdown
- Auto-repartitioning for better performance
- Improved column projection and filter optimization
-
Benchmark comparisons:
- Dask outperformed Spark on ~50% of queries
- DuckDB beats Dask on some queries at smaller scales
- Polars performs well for medium-sized data on single machines
- No clear winner across all scales/scenarios
-
Scale considerations:
- Dask works best for datasets 100GB+
- DuckDB/Polars excel at single-machine workloads
- Spark struggles with smaller scale data but works well at larger scales
- 10TB+ workloads require careful resource management
-
Practical implications:
- Migration from pandas to Dask is relatively straightforward
- Distributed processing requires more resources but enables scaling
- Query optimization is now automatic, reducing manual optimization needs
- Network transfer remains a significant bottleneck (~200MB/s vs 5-10GB/s in-memory)
-
Future improvements planned:
- Further optimization of merge operations
- Reduction of system overhead
- Better handling of complex queries
- Improved out-of-core performance