We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Ian Ozsvald & Giles Weaver - Pandas 2, Dask or Polars? Tackling larger data | Pydata Global 2023
Learn how Pandas 2, Dask & Polars handle large datasets differently. Compare performance, memory usage & features to choose the right tool for your data science needs.
-
Pandas 2 with Arrow backend shows significant memory improvements (82M rows: 39GB vs 11GB) but performance varies based on operation type
-
Copy-on-write functionality is coming to Pandas 3 (2024) and should be tested now since it will be default - improves memory usage and performance
-
Polars shows impressive performance gains over Pandas (3-10x faster for many operations) and handles strings better, but has a different API and some missing functionality
-
Dask works well for distributed computing and larger-than-RAM datasets, with recent improvements in query optimization through Dask Expressions
-
For medium-sized data (fits on disk but not RAM), both Dask and Polars are viable options - Dask has more mature distributed computing while Polars excels at single-machine performance
-
Arrow enables cross-platform/cross-library data sharing between Pandas, Polars and other tools that support the Arrow format
-
Polars forces a more efficient computational approach through its API design, while Pandas offers more flexibility but potential performance pitfalls
-
Duck DB shows promising performance for SQL queries on large CSV/Parquet files
-
Missing data handling differs between Polars and Pandas - important to understand the differences when migrating
-
When benchmarking, results can vary significantly based on operation type, data size and back-end used (NumPy vs Arrow) - important to test with representative workloads