Ian Ozsvald & Giles Weaver - Pandas 2, Dask or Polars? Tackling larger data on a single machine

Rust Python

Explore how Pandas 2, Polars & Dask handle large datasets on single machines. Compare performance, memory usage & use cases to choose the right tool for your data projects.

Key takeaways

Pandas 2 introduces Arrow backend and copy-on-write optimizations, offering significant memory improvements (39GB vs 11GB for 82M rows) and better performance for certain operations
Polars shows 3-10x faster performance than Pandas for many operations, especially with string operations and large datasets, due to its Rust backend and Arrow-native implementation
Dask remains the go-to solution for multi-machine scaling, while Polars excels at single-machine, multi-core processing
Copy-on-write in Pandas will be enabled by default in Pandas 3 (coming next year), currently opt-in with pd.options.mode.copy_on_write=True
Duck DB emerges as a powerful alternative for SQL-like queries on large datasets, particularly effective with CSV and Parquet files
String operations are significantly faster with Arrow backend compared to NumPy backend in Pandas
Mixing tools is recommended: use Pandas with Arrow for better memory efficiency, Polars for performance-critical operations, and Dask for distributed computing
Query optimization differs between tools: Dask Expressions provides better query planning compared to naive Dask implementation
Tool selection considerations:
- Greenfield projects: Consider Polars
- Multi-machine requirements: Use Dask
- Legacy codebase: Gradually adopt Pandas with Arrow backend
Data interchange between libraries is seamless due to Arrow format, allowing easy sharing between Pandas, Polars, R, and Julia

Ian Ozsvald & Giles Weaver - Pandas 2, Dask or Polars? Tackling larger data on a single machine

More talks