Eswaramoorthy & Pevey- A practical guide to analysis & interactive visualization of massive datasets

Get hands-on guidance on analyzing and visualizing massive datasets with Dask, a practical tool for parallel computing, data frames, and interactive visualization.

Key takeaways
  • Dask allows for parallel computing, enabling processing of large datasets on a single machine.
    • Allows for compute-intensive operations on data that doesn’t fit in memory.
    • Dask runs tasks in parallel on multiple workers, improving performance.
  • Dask data frames are built on pandas data frames and inherit many pandas features.
    • Can be used as a drop-in replacement for pandas with most operations.
    • Can read and write CSV, Parquet, and other file formats.
  • Persisting intermediate results can significantly speed up computation.
    • Reduces communication overhead between workers.
    • Can reduce memory usage by only retaining necessary data.
  • Task graph visualization provides insight into the compute workflow.
    • Shows dependencies and ordering of tasks.
    • Can identify bottlenecks and optimize the workflow.
  • disabling compute on a local machine runs tasks in parallel on multiple workers.
  • HP plot provides a detailed view of task progress.
  • Cluster map provides a high-level view of task execution.
  • Progress plot shows the completion status of tasks.
  • Workers’ memory usage can be monitored and adjusted during computation.
  • Dask data frames can be combined using the concat function.
  • Dask can handle large datasets by shuffling data and processing in parallel.
  • GridView is a visualization tool for Dask.
  • Kable is a table formatting tool for Dask.
  • Bokeh is a plotting library for interactive visualizations.