Patrick Hoefler & Matthew Rocklin - Arrow revolution in pandas and Dask | PyData Global 2023

Python

Learn how Apache Arrow integration in pandas and Dask enables massive performance gains through efficient data types, copy-on-write, and improved distributed computing.

Key takeaways

Apache Arrow integration brings major performance improvements to both pandas and Dask through more efficient data types and memory usage, especially for strings
Copy-on-write functionality will be enabled by default in pandas 3.0, eliminating confusing copy warnings and providing more consistent behavior around data frame modifications
New p2p shuffle algorithm in Dask enables constant memory usage during shuffling operations, greatly improving stability for distributed computing
Query optimization in Dask now automatically reorders and simplifies queries, reducing unnecessary operations and improving performance up to 200x in some cases
PyArrow-backed strings are becoming the default in pandas 3.0 (Q2 2024), reducing memory usage by 50%+ compared to object dtype strings
Dask’s scheduler improvements allow processing of large datasets (10+ TB) with minimal memory overhead through better distributed computation
Interoperability between PyData libraries has improved through Arrow adoption, enabling zero-copy transfers between pandas, Dask, and other tools
Method chaining behavior remains unchanged with the new improvements while providing better performance and reduced memory usage
Task-based operations in Dask now scale linearly instead of exponentially with data size due to optimizer improvements
Most improvements are already available in current versions through opt-in flags, allowing users to test new features before they become defaults

Patrick Hoefler & Matthew Rocklin - Arrow revolution in pandas and Dask | PyData Global 2023

More talks