We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023
Learn the key differences between Pandas 2, Polars, and DuckDB data frameworks. Compare features, performance benefits, and best use cases for data analysis projects.
-
Pandas 2 introduces optional PyArrow backend which provides better string handling and memory efficiency (50-70% reduction for strings)
-
Polars focuses on small/medium data with intelligent query optimization and parallelization, especially excelling at group by operations across multiple cores
-
DuckDB functions as an in-process analytical database with SQL capabilities and can directly query against Pandas/Polars dataframes without data copying
-
Key differences between frameworks:
- Pandas: Eager evaluation, broad API coverage, visualization support
- Polars: Lazy evaluation, query optimization, faster group operations
- DuckDB: SQL-first approach, advanced query engine, Spark API compatibility
-
Framework selection guidance:
- Small data (fits in RAM): Pandas/Polars
- Medium data (fits on disk): Polars/DuckDB
- Big data (distributed): Spark/other distributed solutions
-
Polars intentionally omits index functionality (unlike Pandas) as it views row labels as unnecessary overhead
-
Converting between frameworks is straightforward when using PyArrow as the common backend format
-
Pandas remains recommended for beginners due to extensive documentation, community support, and job market demand
-
Performance benefits of Polars/DuckDB come from Rust implementation and intelligent query optimization
-
Copy-on-write functionality in Pandas 2 improves memory usage during chained operations