Talks - Patrick Hoefler: Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars

Python

Explore the performance improvements in Dask DataFrame 2.0 and how it compares to alternatives like Spark, DuckDB, and Polars for large-scale data processing tasks.

Key takeaways

Dask has significantly improved performance over the past year, now competitive with Spark and sometimes outperforming it, especially on larger datasets (1-10TB scale)
DuckDB shows surprisingly good performance on single-node workloads, but struggles when data exceeds available RAM. Best for workloads that fit in memory on a single large machine
New query optimizer in Dask provides major performance gains through:
- Column projection (reading only needed columns)
- Predicate pushdown (filtering early)
- Smart merge/join operations
- Better shuffle algorithms
Memory efficiency improvements:
- New PyArrow-backed strings replace NumPy object strings
- Reduced memory usage by 60-70% on average
- Better handling of out-of-core processing
Dask’s partition size recommendations:
- Avoid tiny partitions (few rows) due to overhead
- Avoid giant partitions (multiple GB)
- Aim for hundreds of MB per partition
- Auto-repartitioning helps maintain optimal sizes
Key architectural improvements:
- New peer-to-peer shuffle algorithm for better scaling
- Smarter merge operations that avoid unnecessary shuffles
- Better handling of distributed joins
- Improved resilience and reliability
Current tool positioning:
- Dask: Best for distributed workloads >1TB
- DuckDB: Excellent for single-node analysis
- Spark: Still faster on some queries but struggles with large-scale out-of-core processing
- Polars: Good performance but not yet ready for very large scales

Talks - Patrick Hoefler: Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars

More talks