Ian Ozsvald & Giles Weaver - Pandas 2, Dask or Polars? Tackling larger data on a single machine

Python

Explore the best libraries for tackling large datasets on a single machine, including Pandas, Dask, Polars, and Arrow. Discover the trade-offs and considerations for each option and learn how to optimize performance.

Key takeaways

Consider using Dask for larger datasets, as it provides distributed computation and better performance.
Pandas is still the best choice for smaller datasets, but Dask is a good alternative for larger datasets.
Polars is a new project that aims to improve performance and memory efficiency, and is worth considering for new projects.
Arrow is a memory-efficient binary columnar storage format that can be used with Pandas, Dask, and Polars.
Copy-on-write optimization in pandas can improve performance, but is not enabled by default.
Dask expressions can be used to simplify and optimize code, but require a different mindset.
Duck DB is a new project that is worth monitoring, as it provides a similar API to Pandas.
Scikit-learn does not support Polars, and existing code may need to be modified.
Arrow and Polars are designed for use with large datasets, but may not be suitable for smaller datasets.
Pandas is still the best choice for most use cases, but Dask and Polars are worth considering for larger datasets or for specific use cases.
Polars and Dask are designed for scalability, but may not be suitable for very small datasets.
Arrow is a memory-efficient format that can be used with Pandas, Dask, and Polars.
Copy-on-write optimization can be used to improve performance, but is not enabled by default.

Ian Ozsvald & Giles Weaver - Pandas 2, Dask or Polars? Tackling larger data on a single machine

More talks