We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Patrick Hoefler & Matthew Rocklin - Arrow revolution in pandas and Dask | PyData Global 2023
Learn how Apache Arrow integration in pandas and Dask enables massive performance gains through efficient data types, copy-on-write, and improved distributed computing.
-
Apache Arrow integration brings major performance improvements to both pandas and Dask through more efficient data types and memory usage, especially for strings
-
Copy-on-write functionality will be enabled by default in pandas 3.0, eliminating confusing copy warnings and providing more consistent behavior around data frame modifications
-
New p2p shuffle algorithm in Dask enables constant memory usage during shuffling operations, greatly improving stability for distributed computing
-
Query optimization in Dask now automatically reorders and simplifies queries, reducing unnecessary operations and improving performance up to 200x in some cases
-
PyArrow-backed strings are becoming the default in pandas 3.0 (Q2 2024), reducing memory usage by 50%+ compared to object dtype strings
-
Dask’s scheduler improvements allow processing of large datasets (10+ TB) with minimal memory overhead through better distributed computation
-
Interoperability between PyData libraries has improved through Arrow adoption, enabling zero-copy transfers between pandas, Dask, and other tools
-
Method chaining behavior remains unchanged with the new improvements while providing better performance and reduced memory usage
-
Task-based operations in Dask now scale linearly instead of exponentially with data size due to optimizer improvements
-
Most improvements are already available in current versions through opt-in flags, allowing users to test new features before they become defaults