Naty Clementi, James Bourbeau, Julia Signell, Charles Blackmon-Luca - Advanced Dask Tutorial | SciPy

Dask tutorial covering advanced concepts and best practices for parallelizing computations, working with large datasets, and optimizing performance with chunking, caching, and customization.

Key takeaways

Dask can scale your computations by parallelizing them across multiple processes and machines.
To use Dask effectively, partition your data in a way that maximizes utilization of available resources.
Dask arrays are similar to NumPy arrays, but with additional features like chunking and caching.
Chunking is a way to divide large arrays into smaller, manageable pieces, improving performance and memory usage.
Dask supports various file formats, including CSV, Parquet, and HDF5.
Persistence can be used to store intermediate results, reducing the need for re-computation.
Dask has a scheduler that manages the flow of computations and allows for customization of optimization.
Optimization can be achieved through techniques like caching, delayed computations, and memory management.
Data locality and memory usage are important considerations when working with large datasets.
Dask provides tools for visualization and debugging, such as the dashboard and task graph visualization.
Dask can be used with various infrastructure, including local machines, clusters, and cloud services.
Dask arrays support various optimized data types, such as PyArrow and Pandas DataFrames.
Custom operations can be defined to perform specific tasks, and Dask provides a API for creating and extending operations.
Dask provides a way to easily switch between different backend storage options, such as Pandas and Parquet.

Naty Clementi, James Bourbeau, Julia Signell, Charles Blackmon-Luca - Advanced Dask Tutorial | SciPy

More talks