The PyArrow revolution in Pandas — Reuven M. Lerner

Python

Explore how PyArrow revolutionizes Pandas with 5x faster operations, reduced memory usage, and better data handling. Learn about new file formats and future integration plans.

Key takeaways

PyArrow offers significant performance improvements over traditional Pandas/NumPy backend, with 5x faster CSV reading and up to 80% reduced memory usage
PyArrow introduces nullable integer types that handle NaN values without forcing conversion to floats, solving a common Pandas limitation
New file formats Feather and Parquet provide faster read/write operations and better compression compared to CSV files
- Feather is faster but uncompressed
- Parquet is slower but highly compressed
Column-based storage in PyArrow (vs row-based in traditional Pandas) enables faster data analysis operations, though row operations become slower
PyArrow provides better type inference and handling of dates, strings, and complex data types out of the box
While PyArrow is generally faster, it’s not always the case - especially for row operations and certain string operations which can be 30% slower
PyArrow integration is becoming a core part of Pandas’ future, with plans to make it the default backend in Pandas 3.0
Workflow recommendation: Load data once with PyArrow, save to Parquet/Feather format for subsequent faster access
PyArrow enables better interoperability between different data analysis tools and languages
Current limitations include experimental status of some features and potential issues with custom Python classes or complex data structures

The PyArrow revolution in Pandas — Reuven M. Lerner

More talks