Ian Ozsvald & Giles Weaver - Pandas 2, Dask or Polars? Tackling larger data | Pydata Global 2023

Python

Learn how Pandas 2, Dask & Polars handle large datasets differently. Compare performance, memory usage & features to choose the right tool for your data science needs.

Key takeaways

Pandas 2 with Arrow backend shows significant memory improvements (82M rows: 39GB vs 11GB) but performance varies based on operation type
Copy-on-write functionality is coming to Pandas 3 (2024) and should be tested now since it will be default - improves memory usage and performance
Polars shows impressive performance gains over Pandas (3-10x faster for many operations) and handles strings better, but has a different API and some missing functionality
Dask works well for distributed computing and larger-than-RAM datasets, with recent improvements in query optimization through Dask Expressions
For medium-sized data (fits on disk but not RAM), both Dask and Polars are viable options - Dask has more mature distributed computing while Polars excels at single-machine performance
Arrow enables cross-platform/cross-library data sharing between Pandas, Polars and other tools that support the Arrow format
Polars forces a more efficient computational approach through its API design, while Pandas offers more flexibility but potential performance pitfalls
Duck DB shows promising performance for SQL queries on large CSV/Parquet files
Missing data handling differs between Polars and Pandas - important to understand the differences when migrating
When benchmarking, results can vary significantly based on operation type, data size and back-end used (NumPy vs Arrow) - important to test with representative workloads

Ian Ozsvald & Giles Weaver - Pandas 2, Dask or Polars? Tackling larger data | Pydata Global 2023

More talks