Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023

Matt Harrison

Learn the key differences between Pandas 2, Polars, and DuckDB data frameworks. Compare features, performance benefits, and best use cases for data analysis projects.

Key takeaways
  • Pandas 2 introduces optional PyArrow backend which provides better string handling and memory efficiency (50-70% reduction for strings)

  • Polars focuses on small/medium data with intelligent query optimization and parallelization, especially excelling at group by operations across multiple cores

  • DuckDB functions as an in-process analytical database with SQL capabilities and can directly query against Pandas/Polars dataframes without data copying

  • Key differences between frameworks:

    • Pandas: Eager evaluation, broad API coverage, visualization support
    • Polars: Lazy evaluation, query optimization, faster group operations
    • DuckDB: SQL-first approach, advanced query engine, Spark API compatibility
  • Framework selection guidance:

    • Small data (fits in RAM): Pandas/Polars
    • Medium data (fits on disk): Polars/DuckDB
    • Big data (distributed): Spark/other distributed solutions
  • Polars intentionally omits index functionality (unlike Pandas) as it views row labels as unnecessary overhead

  • Converting between frameworks is straightforward when using PyArrow as the common backend format

  • Pandas remains recommended for beginners due to extensive documentation, community support, and job market demand

  • Performance benefits of Polars/DuckDB come from Rust implementation and intelligent query optimization

  • Copy-on-write functionality in Pandas 2 improves memory usage during chained operations