We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Hyukjin Kwon - Demystifying pandas with PySpark when scaling out | PyData Vermont 2024
Learn how to effectively scale Pandas code with PySpark, including key differences, debugging tips, dependency management, and performance optimization strategies from Hyukjin Kwon.
- 
    PySpark and Pandas have fundamental differences in mutability - Pandas is mutable and eagerly evaluated, while PySpark is immutable and lazily evaluated 
- 
    Three main options when Pandas can’t scale: - Use bigger machine
- Down-sample data
- Switch to distributed framework like PySpark
 
- 
    Pandas API on Spark provides familiar Pandas syntax while leveraging PySpark’s distributed capabilities - 80-90% API coverage
- Simply replace import line to scale Pandas code
- Introduced in 2019 as Koalas, integrated into PySpark in 2021
 
- 
    Key debugging and performance considerations: - Ensure data is properly partitioned and evenly distributed
- Verify code works on single node before scaling
- Use Spark UI for debugging performance issues
- Avoid excessive use of for loops
- Leverage runtime profilers
 
- 
    Dependency management improvements in Spark: - Session-scoped dependencies through Spark Connect
- Support for different versions of dependencies across sessions
- Can pack conda environments into tar files for distribution
- UDF-level dependency support coming in Spark 4.0
 
- 
    Three types of index handling in distributed environment: - Sequence (single node)
- Distributed sequence (triggers shuffle)
- Distributed (non-sequential)
 
- 
    Pandas UDF (User Defined Functions) provides way to: - Work with native Pandas instances
- Execute vectorized operations
- Handle batch processing of PySpark DataFrames
- Maintain type safety through annotations