Ian Ozsvald & Giles Weaver - Pandas 2, Dask or Polars? Tackling larger data on a single machine

Explore how Pandas 2, Polars & Dask handle large datasets on single machines. Compare performance, memory usage & use cases to choose the right tool for your data projects.

Key takeaways
  • Pandas 2 introduces Arrow backend and copy-on-write optimizations, offering significant memory improvements (39GB vs 11GB for 82M rows) and better performance for certain operations

  • Polars shows 3-10x faster performance than Pandas for many operations, especially with string operations and large datasets, due to its Rust backend and Arrow-native implementation

  • Dask remains the go-to solution for multi-machine scaling, while Polars excels at single-machine, multi-core processing

  • Copy-on-write in Pandas will be enabled by default in Pandas 3 (coming next year), currently opt-in with pd.options.mode.copy_on_write=True

  • Duck DB emerges as a powerful alternative for SQL-like queries on large datasets, particularly effective with CSV and Parquet files

  • String operations are significantly faster with Arrow backend compared to NumPy backend in Pandas

  • Mixing tools is recommended: use Pandas with Arrow for better memory efficiency, Polars for performance-critical operations, and Dask for distributed computing

  • Query optimization differs between tools: Dask Expressions provides better query planning compared to naive Dask implementation

  • Tool selection considerations:

    • Greenfield projects: Consider Polars
    • Multi-machine requirements: Use Dask
    • Legacy codebase: Gradually adopt Pandas with Arrow backend
  • Data interchange between libraries is seamless due to Arrow format, allowing easy sharing between Pandas, Polars, R, and Julia