Hyukjin Kwon - Demystifying pandas with PySpark when scaling out | PyData Vermont 2024

Python

Learn how to effectively scale Pandas code with PySpark, including key differences, debugging tips, dependency management, and performance optimization strategies from Hyukjin Kwon.

Key takeaways

PySpark and Pandas have fundamental differences in mutability - Pandas is mutable and eagerly evaluated, while PySpark is immutable and lazily evaluated
Three main options when Pandas can’t scale:
- Use bigger machine
- Down-sample data
- Switch to distributed framework like PySpark
Pandas API on Spark provides familiar Pandas syntax while leveraging PySpark’s distributed capabilities
- 80-90% API coverage
- Simply replace import line to scale Pandas code
- Introduced in 2019 as Koalas, integrated into PySpark in 2021
Key debugging and performance considerations:
- Ensure data is properly partitioned and evenly distributed
- Verify code works on single node before scaling
- Use Spark UI for debugging performance issues
- Avoid excessive use of for loops
- Leverage runtime profilers
Dependency management improvements in Spark:
- Session-scoped dependencies through Spark Connect
- Support for different versions of dependencies across sessions
- Can pack conda environments into tar files for distribution
- UDF-level dependency support coming in Spark 4.0
Three types of index handling in distributed environment:
- Sequence (single node)
- Distributed sequence (triggers shuffle)
- Distributed (non-sequential)
Pandas UDF (User Defined Functions) provides way to:
- Work with native Pandas instances
- Execute vectorized operations
- Handle batch processing of PySpark DataFrames
- Maintain type safety through annotations

Hyukjin Kwon - Demystifying pandas with PySpark when scaling out | PyData Vermont 2024

More talks