Doris Lee - Scaling your data science workflows with Modin | SciPy 2024

Learn how Modin scales Pandas workflows with zero code changes, using parallel execution and multiple backends to accelerate data processing while maintaining familiar APIs and syntax.

Key takeaways
  • Modin is an open-source library that scales Pandas workflows by changing a single import line, maintaining full Pandas API compatibility

  • Default installation uses Ray as backend, but supports multiple backends including Dask and Snowflake

  • Key advantage is parallel execution - Modin utilizes all available compute cores, while Pandas is single-threaded

  • Maintains interactive development workflow familiar to data scientists without requiring code rewrites for scaling

  • Addresses common pain points of having to rewrite Pandas code into big data frameworks like Spark when moving to production

  • Uses an underlying dataframe algebra that maps 600+ Pandas APIs to core operators, enabling optimization and scalability

  • Recent Snowpark Pandas API integration allows running Pandas code directly on Snowflake without data movement

  • Shows significant performance improvements (several times faster) even on laptop environments without code changes

  • Works both in single-node settings and cluster environments through Ray/Dask backends

  • Particularly valuable for data teams dealing with larger datasets that exceed memory limits but want to maintain Pandas workflow