Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023

Learn the key differences between Pandas 2, Polars, and DuckDB data frameworks. Compare features, performance benefits, and best use cases for data analysis projects.

Key takeaways
  • Pandas 2 introduces optional PyArrow backend which provides better string handling and memory efficiency (50-70% reduction for strings)

  • Polars focuses on small/medium data with intelligent query optimization and parallelization, especially excelling at group by operations across multiple cores

  • DuckDB functions as an in-process analytical database with SQL capabilities and can directly query against Pandas/Polars dataframes without data copying

  • Key differences between frameworks:

    • Pandas: Eager evaluation, broad API coverage, visualization support
    • Polars: Lazy evaluation, query optimization, faster group operations
    • DuckDB: SQL-first approach, advanced query engine, Spark API compatibility
  • Framework selection guidance:

    • Small data (fits in RAM): Pandas/Polars
    • Medium data (fits on disk): Polars/DuckDB
    • Big data (distributed): Spark/other distributed solutions
  • Polars intentionally omits index functionality (unlike Pandas) as it views row labels as unnecessary overhead

  • Converting between frameworks is straightforward when using PyArrow as the common backend format

  • Pandas remains recommended for beginners due to extensive documentation, community support, and job market demand

  • Performance benefits of Polars/DuckDB come from Rust implementation and intelligent query optimization

  • Copy-on-write functionality in Pandas 2 improves memory usage during chained operations