We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Naty Clementi, James Bourbeau, Julia Signell, Charles Blackmon-Luca - Advanced Dask Tutorial | SciPy
Dask tutorial covering advanced concepts and best practices for parallelizing computations, working with large datasets, and optimizing performance with chunking, caching, and customization.
- Dask can scale your computations by parallelizing them across multiple processes and machines.
- To use Dask effectively, partition your data in a way that maximizes utilization of available resources.
- Dask arrays are similar to NumPy arrays, but with additional features like chunking and caching.
- Chunking is a way to divide large arrays into smaller, manageable pieces, improving performance and memory usage.
- Dask supports various file formats, including CSV, Parquet, and HDF5.
- Persistence can be used to store intermediate results, reducing the need for re-computation.
- Dask has a scheduler that manages the flow of computations and allows for customization of optimization.
- Optimization can be achieved through techniques like caching, delayed computations, and memory management.
- Data locality and memory usage are important considerations when working with large datasets.
- Dask provides tools for visualization and debugging, such as the dashboard and task graph visualization.
- Dask can be used with various infrastructure, including local machines, clusters, and cloud services.
- Dask arrays support various optimized data types, such as PyArrow and Pandas DataFrames.
- Custom operations can be defined to perform specific tasks, and Dask provides a API for creating and extending operations.
- Dask provides a way to easily switch between different backend storage options, such as Pandas and Parquet.