Jim Dowling- High speed data from the Lakehouse to DataFrames with Apache Arrow | PyData Global 2023

Learn how Apache Arrow and DuckDB enable lightning-fast data transfer from Lakehouses to DataFrames, revolutionizing Python data exploration with 30x performance gains.

Key takeaways
  • Apache Arrow provides efficient data transfer between different frameworks with zero-copy reads and column-oriented storage format

  • DuckDB acts as a lightweight in-process analytical database that can be pip installed, supporting parallel reads and efficient query processing

  • The HopsWorks feature query service combines DuckDB and Arrow Flight to enable fast access to Lakehouse data for Python clients

  • Feature stores typically store data in Lakehouses (columnar format) using table formats like Delta Lake, Apache Hoodie, or Apache Iceberg

  • Traditional JDBC/ODBC interfaces create bottlenecks due to data format conversion between columnar and row-oriented formats

  • Temporal joins (point-in-time correct joins) are critical for machine learning feature engineering but complex to implement in standard SQL

  • The solution showed 30x performance improvement over JDBC/ODBC when reading data from Parquet files directly into Apache Arrow

  • DuckDB supports multi-threaded processing and can handle millions of rows per second for feature engineering workloads

  • The architecture enables interactive data exploration for Python developers without requiring complex SQL queries

  • Modern single-machine Python workloads can handle increasing data volumes due to larger compute resources (32-64 CPUs, hundreds of GB memory)