We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Ian Ozsvald & Giles Weaver - Pandas 2, Dask or Polars? Tackling larger data on a single machine
Explore how Pandas 2, Polars & Dask handle large datasets on single machines. Compare performance, memory usage & use cases to choose the right tool for your data projects.
-
Pandas 2 introduces Arrow backend and copy-on-write optimizations, offering significant memory improvements (39GB vs 11GB for 82M rows) and better performance for certain operations
-
Polars shows 3-10x faster performance than Pandas for many operations, especially with string operations and large datasets, due to its Rust backend and Arrow-native implementation
-
Dask remains the go-to solution for multi-machine scaling, while Polars excels at single-machine, multi-core processing
-
Copy-on-write in Pandas will be enabled by default in Pandas 3 (coming next year), currently opt-in with
pd.options.mode.copy_on_write=True
-
Duck DB emerges as a powerful alternative for SQL-like queries on large datasets, particularly effective with CSV and Parquet files
-
String operations are significantly faster with Arrow backend compared to NumPy backend in Pandas
-
Mixing tools is recommended: use Pandas with Arrow for better memory efficiency, Polars for performance-critical operations, and Dask for distributed computing
-
Query optimization differs between tools: Dask Expressions provides better query planning compared to naive Dask implementation
-
Tool selection considerations:
- Greenfield projects: Consider Polars
- Multi-machine requirements: Use Dask
- Legacy codebase: Gradually adopt Pandas with Arrow backend
-
Data interchange between libraries is seamless due to Arrow format, allowing easy sharing between Pandas, Polars, R, and Julia