We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Ian Ozsvald & Giles Weaver - Pandas 2, Dask or Polars? Tackling larger data on a single machine
Explore the best libraries for tackling large datasets on a single machine, including Pandas, Dask, Polars, and Arrow. Discover the trade-offs and considerations for each option and learn how to optimize performance.
- Consider using Dask for larger datasets, as it provides distributed computation and better performance.
- Pandas is still the best choice for smaller datasets, but Dask is a good alternative for larger datasets.
- Polars is a new project that aims to improve performance and memory efficiency, and is worth considering for new projects.
- Arrow is a memory-efficient binary columnar storage format that can be used with Pandas, Dask, and Polars.
- Copy-on-write optimization in pandas can improve performance, but is not enabled by default.
- Dask expressions can be used to simplify and optimize code, but require a different mindset.
- Duck DB is a new project that is worth monitoring, as it provides a similar API to Pandas.
- Scikit-learn does not support Polars, and existing code may need to be modified.
- Arrow and Polars are designed for use with large datasets, but may not be suitable for smaller datasets.
- Pandas is still the best choice for most use cases, but Dask and Polars are worth considering for larger datasets or for specific use cases.
- Polars and Dask are designed for scalability, but may not be suitable for very small datasets.
- Arrow is a memory-efficient format that can be used with Pandas, Dask, and Polars.
- Copy-on-write optimization can be used to improve performance, but is not enabled by default.