We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Eswaramoorthy & Pevey- A practical guide to analysis & interactive visualization of massive datasets
Get hands-on guidance on analyzing and visualizing massive datasets with Dask, a practical tool for parallel computing, data frames, and interactive visualization.
-
Dask allows for parallel computing, enabling processing of large datasets on a single machine.
- Allows for compute-intensive operations on data that doesn’t fit in memory.
- Dask runs tasks in parallel on multiple workers, improving performance.
-
Dask data frames are built on pandas data frames and inherit many pandas features.
- Can be used as a drop-in replacement for pandas with most operations.
- Can read and write CSV, Parquet, and other file formats.
-
Persisting intermediate results can significantly speed up computation.
- Reduces communication overhead between workers.
- Can reduce memory usage by only retaining necessary data.
-
Task graph visualization provides insight into the compute workflow.
- Shows dependencies and ordering of tasks.
- Can identify bottlenecks and optimize the workflow.
- disabling compute on a local machine runs tasks in parallel on multiple workers.
- HP plot provides a detailed view of task progress.
- Cluster map provides a high-level view of task execution.
- Progress plot shows the completion status of tasks.
- Workers’ memory usage can be monitored and adjusted during computation.
-
Dask data frames can be combined using the
concat
function. - Dask can handle large datasets by shuffling data and processing in parallel.
- GridView is a visualization tool for Dask.
- Kable is a table formatting tool for Dask.
- Bokeh is a plotting library for interactive visualizations.