Doris Lee - Scaling your data science workflows with Modin | SciPy 2024

Python

Learn how Modin scales Pandas workflows with zero code changes, using parallel execution and multiple backends to accelerate data processing while maintaining familiar APIs and syntax.

Key takeaways

Modin is an open-source library that scales Pandas workflows by changing a single import line, maintaining full Pandas API compatibility
Default installation uses Ray as backend, but supports multiple backends including Dask and Snowflake
Key advantage is parallel execution - Modin utilizes all available compute cores, while Pandas is single-threaded
Maintains interactive development workflow familiar to data scientists without requiring code rewrites for scaling
Addresses common pain points of having to rewrite Pandas code into big data frameworks like Spark when moving to production
Uses an underlying dataframe algebra that maps 600+ Pandas APIs to core operators, enabling optimization and scalability
Recent Snowpark Pandas API integration allows running Pandas code directly on Snowflake without data movement
Shows significant performance improvements (several times faster) even on laptop environments without code changes
Works both in single-node settings and cluster environments through Ray/Dask backends
Particularly valuable for data teams dealing with larger datasets that exceed memory limits but want to maintain Pandas workflow

Doris Lee - Scaling your data science workflows with Modin | SciPy 2024

More talks