Matthew Rocklin - Dask in Production | SciPy 2024

Matthew Rocklin shares practical insights on optimizing Dask for production: from reducing cloud costs with ARM instances to avoiding common infrastructure pitfalls & deployment tips.

Key takeaways
  • Cloud computing costs can be significantly reduced by:

    • Using ARM instances instead of Intel (5% faster, cheaper)
    • Leveraging spot instances when available
    • Turning off resources when not actively in use
    • Running workloads close to where data is stored
  • Running Dask in production revealed:

    • The Global Interpreter Lock (GIL) is usually not a bottleneck (only ~25% contention)
    • Most workloads can process 1TB of data in ~5 hours for ~10 cents
    • Scaling is underutilized because people think it’s more expensive than it is
    • Raw cloud architecture (basic EC2 + networking) often works better than complex Kubernetes setups
  • Common cloud infrastructure challenges:

    • Docker wasn’t designed for rapid development cycles
    • Serverless functions (Lambda) are 4x more expensive than regular instances
    • Users often leave large VMs running 24/7 unnecessarily
    • Moving data between regions/services is extremely costly
  • Success factors for cloud deployments:

    • Making cloud environments match local development environments
    • Collecting detailed metrics on usage patterns
    • Supporting hardware flexibility across regions/instance types
    • Enabling rapid environment synchronization
  • The scientific Python ecosystem is increasingly ARM-compatible:

    • 90-95% of workloads can run on ARM
    • Only specific cases (like MKL-dependent code) require Intel
    • Community should move towards ARM as the default