Talks - Patrick Hoefler: Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars

Explore the performance improvements in Dask DataFrame 2.0 and how it compares to alternatives like Spark, DuckDB, and Polars for large-scale data processing tasks.

Key takeaways
  • Dask has significantly improved performance over the past year, now competitive with Spark and sometimes outperforming it, especially on larger datasets (1-10TB scale)

  • DuckDB shows surprisingly good performance on single-node workloads, but struggles when data exceeds available RAM. Best for workloads that fit in memory on a single large machine

  • New query optimizer in Dask provides major performance gains through:

    • Column projection (reading only needed columns)
    • Predicate pushdown (filtering early)
    • Smart merge/join operations
    • Better shuffle algorithms
  • Memory efficiency improvements:

    • New PyArrow-backed strings replace NumPy object strings
    • Reduced memory usage by 60-70% on average
    • Better handling of out-of-core processing
  • Dask’s partition size recommendations:

    • Avoid tiny partitions (few rows) due to overhead
    • Avoid giant partitions (multiple GB)
    • Aim for hundreds of MB per partition
    • Auto-repartitioning helps maintain optimal sizes
  • Key architectural improvements:

    • New peer-to-peer shuffle algorithm for better scaling
    • Smarter merge operations that avoid unnecessary shuffles
    • Better handling of distributed joins
    • Improved resilience and reliability
  • Current tool positioning:

    • Dask: Best for distributed workloads >1TB
    • DuckDB: Excellent for single-node analysis
    • Spark: Still faster on some queries but struggles with large-scale out-of-core processing
    • Polars: Good performance but not yet ready for very large scales