Ian Ozsvald & Giles Weaver - Pandas 2, Dask or Polars? Tackling larger data on a single machine

Explore the best libraries for tackling large datasets on a single machine, including Pandas, Dask, Polars, and Arrow. Discover the trade-offs and considerations for each option and learn how to optimize performance.

Key takeaways
  • Consider using Dask for larger datasets, as it provides distributed computation and better performance.
  • Pandas is still the best choice for smaller datasets, but Dask is a good alternative for larger datasets.
  • Polars is a new project that aims to improve performance and memory efficiency, and is worth considering for new projects.
  • Arrow is a memory-efficient binary columnar storage format that can be used with Pandas, Dask, and Polars.
  • Copy-on-write optimization in pandas can improve performance, but is not enabled by default.
  • Dask expressions can be used to simplify and optimize code, but require a different mindset.
  • Duck DB is a new project that is worth monitoring, as it provides a similar API to Pandas.
  • Scikit-learn does not support Polars, and existing code may need to be modified.
  • Arrow and Polars are designed for use with large datasets, but may not be suitable for smaller datasets.
  • Pandas is still the best choice for most use cases, but Dask and Polars are worth considering for larger datasets or for specific use cases.
  • Polars and Dask are designed for scalability, but may not be suitable for very small datasets.
  • Arrow is a memory-efficient format that can be used with Pandas, Dask, and Polars.
  • Copy-on-write optimization can be used to improve performance, but is not enabled by default.