We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Jim Dowling- High speed data from the Lakehouse to DataFrames with Apache Arrow | PyData Global 2023
Learn how Apache Arrow and DuckDB enable lightning-fast data transfer from Lakehouses to DataFrames, revolutionizing Python data exploration with 30x performance gains.
-
Apache Arrow provides efficient data transfer between different frameworks with zero-copy reads and column-oriented storage format
-
DuckDB acts as a lightweight in-process analytical database that can be pip installed, supporting parallel reads and efficient query processing
-
The HopsWorks feature query service combines DuckDB and Arrow Flight to enable fast access to Lakehouse data for Python clients
-
Feature stores typically store data in Lakehouses (columnar format) using table formats like Delta Lake, Apache Hoodie, or Apache Iceberg
-
Traditional JDBC/ODBC interfaces create bottlenecks due to data format conversion between columnar and row-oriented formats
-
Temporal joins (point-in-time correct joins) are critical for machine learning feature engineering but complex to implement in standard SQL
-
The solution showed 30x performance improvement over JDBC/ODBC when reading data from Parquet files directly into Apache Arrow
-
DuckDB supports multi-threaded processing and can handle millions of rows per second for feature engineering workloads
-
The architecture enables interactive data exploration for Python developers without requiring complex SQL queries
-
Modern single-machine Python workloads can handle increasing data volumes due to larger compute resources (32-64 CPUs, hundreds of GB memory)