We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Doris Lee - Scaling your data science workflows with Modin | SciPy 2024
Learn how Modin scales Pandas workflows with zero code changes, using parallel execution and multiple backends to accelerate data processing while maintaining familiar APIs and syntax.
-
Modin is an open-source library that scales Pandas workflows by changing a single import line, maintaining full Pandas API compatibility
-
Default installation uses Ray as backend, but supports multiple backends including Dask and Snowflake
-
Key advantage is parallel execution - Modin utilizes all available compute cores, while Pandas is single-threaded
-
Maintains interactive development workflow familiar to data scientists without requiring code rewrites for scaling
-
Addresses common pain points of having to rewrite Pandas code into big data frameworks like Spark when moving to production
-
Uses an underlying dataframe algebra that maps 600+ Pandas APIs to core operators, enabling optimization and scalability
-
Recent Snowpark Pandas API integration allows running Pandas code directly on Snowflake without data movement
-
Shows significant performance improvements (several times faster) even on laptop environments without code changes
-
Works both in single-node settings and cluster environments through Ray/Dask backends
-
Particularly valuable for data teams dealing with larger datasets that exceed memory limits but want to maintain Pandas workflow