Patrick Hoefler & Matthew Rocklin - Arrow revolution in pandas and Dask | PyData Global 2023

Learn how Apache Arrow integration in pandas and Dask enables massive performance gains through efficient data types, copy-on-write, and improved distributed computing.

Key takeaways
  • Apache Arrow integration brings major performance improvements to both pandas and Dask through more efficient data types and memory usage, especially for strings

  • Copy-on-write functionality will be enabled by default in pandas 3.0, eliminating confusing copy warnings and providing more consistent behavior around data frame modifications

  • New p2p shuffle algorithm in Dask enables constant memory usage during shuffling operations, greatly improving stability for distributed computing

  • Query optimization in Dask now automatically reorders and simplifies queries, reducing unnecessary operations and improving performance up to 200x in some cases

  • PyArrow-backed strings are becoming the default in pandas 3.0 (Q2 2024), reducing memory usage by 50%+ compared to object dtype strings

  • Dask’s scheduler improvements allow processing of large datasets (10+ TB) with minimal memory overhead through better distributed computation

  • Interoperability between PyData libraries has improved through Arrow adoption, enabling zero-copy transfers between pandas, Dask, and other tools

  • Method chaining behavior remains unchanged with the new improvements while providing better performance and reduced memory usage

  • Task-based operations in Dask now scale linearly instead of exponentially with data size due to optimizer improvements

  • Most improvements are already available in current versions through opt-in flags, allowing users to test new features before they become defaults