Jim Dowling- High speed data from the Lakehouse to DataFrames with Apache Arrow | PyData Global 2023

Python

Learn how Apache Arrow and DuckDB enable lightning-fast data transfer from Lakehouses to DataFrames, revolutionizing Python data exploration with 30x performance gains.

Key takeaways

Apache Arrow provides efficient data transfer between different frameworks with zero-copy reads and column-oriented storage format
DuckDB acts as a lightweight in-process analytical database that can be pip installed, supporting parallel reads and efficient query processing
The HopsWorks feature query service combines DuckDB and Arrow Flight to enable fast access to Lakehouse data for Python clients
Feature stores typically store data in Lakehouses (columnar format) using table formats like Delta Lake, Apache Hoodie, or Apache Iceberg
Traditional JDBC/ODBC interfaces create bottlenecks due to data format conversion between columnar and row-oriented formats
Temporal joins (point-in-time correct joins) are critical for machine learning feature engineering but complex to implement in standard SQL
The solution showed 30x performance improvement over JDBC/ODBC when reading data from Parquet files directly into Apache Arrow
DuckDB supports multi-threaded processing and can handle millions of rows per second for feature engineering workloads
The architecture enables interactive data exploration for Python developers without requiring complex SQL queries
Modern single-machine Python workloads can handle increasing data volumes due to larger compute resources (32-64 CPUs, hundreds of GB memory)

Jim Dowling- High speed data from the Lakehouse to DataFrames with Apache Arrow | PyData Global 2023

More talks