Doris Lee - Scaling your data science workflows with Modin | SciPy 2024

Doris Lee

Learn how Modin scales Pandas workflows with zero code changes, using parallel execution and multiple backends to accelerate data processing while maintaining familiar APIs and syntax.

Key takeaways
  • Modin is an open-source library that scales Pandas workflows by changing a single import line, maintaining full Pandas API compatibility

  • Default installation uses Ray as backend, but supports multiple backends including Dask and Snowflake

  • Key advantage is parallel execution - Modin utilizes all available compute cores, while Pandas is single-threaded

  • Maintains interactive development workflow familiar to data scientists without requiring code rewrites for scaling

  • Addresses common pain points of having to rewrite Pandas code into big data frameworks like Spark when moving to production

  • Uses an underlying dataframe algebra that maps 600+ Pandas APIs to core operators, enabling optimization and scalability

  • Recent Snowpark Pandas API integration allows running Pandas code directly on Snowflake without data movement

  • Shows significant performance improvements (several times faster) even on laptop environments without code changes

  • Works both in single-node settings and cluster environments through Ray/Dask backends

  • Particularly valuable for data teams dealing with larger datasets that exceed memory limits but want to maintain Pandas workflow