We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Hyukjin Kwon - Demystifying pandas with PySpark when scaling out | PyData Vermont 2024
Learn how to effectively scale Pandas code with PySpark, including key differences, debugging tips, dependency management, and performance optimization strategies from Hyukjin Kwon.
-
PySpark and Pandas have fundamental differences in mutability - Pandas is mutable and eagerly evaluated, while PySpark is immutable and lazily evaluated
-
Three main options when Pandas can’t scale:
- Use bigger machine
- Down-sample data
- Switch to distributed framework like PySpark
-
Pandas API on Spark provides familiar Pandas syntax while leveraging PySpark’s distributed capabilities
- 80-90% API coverage
- Simply replace import line to scale Pandas code
- Introduced in 2019 as Koalas, integrated into PySpark in 2021
-
Key debugging and performance considerations:
- Ensure data is properly partitioned and evenly distributed
- Verify code works on single node before scaling
- Use Spark UI for debugging performance issues
- Avoid excessive use of for loops
- Leverage runtime profilers
-
Dependency management improvements in Spark:
- Session-scoped dependencies through Spark Connect
- Support for different versions of dependencies across sessions
- Can pack conda environments into tar files for distribution
- UDF-level dependency support coming in Spark 4.0
-
Three types of index handling in distributed environment:
- Sequence (single node)
- Distributed sequence (triggers shuffle)
- Distributed (non-sequential)
-
Pandas UDF (User Defined Functions) provides way to:
- Work with native Pandas instances
- Execute vectorized operations
- Handle batch processing of PySpark DataFrames
- Maintain type safety through annotations