Hyukjin Kwon - Demystifying pandas with PySpark when scaling out | PyData Vermont 2024

Learn how to effectively scale Pandas code with PySpark, including key differences, debugging tips, dependency management, and performance optimization strategies from Hyukjin Kwon.

Key takeaways
  • PySpark and Pandas have fundamental differences in mutability - Pandas is mutable and eagerly evaluated, while PySpark is immutable and lazily evaluated

  • Three main options when Pandas can’t scale:

    • Use bigger machine
    • Down-sample data
    • Switch to distributed framework like PySpark
  • Pandas API on Spark provides familiar Pandas syntax while leveraging PySpark’s distributed capabilities

    • 80-90% API coverage
    • Simply replace import line to scale Pandas code
    • Introduced in 2019 as Koalas, integrated into PySpark in 2021
  • Key debugging and performance considerations:

    • Ensure data is properly partitioned and evenly distributed
    • Verify code works on single node before scaling
    • Use Spark UI for debugging performance issues
    • Avoid excessive use of for loops
    • Leverage runtime profilers
  • Dependency management improvements in Spark:

    • Session-scoped dependencies through Spark Connect
    • Support for different versions of dependencies across sessions
    • Can pack conda environments into tar files for distribution
    • UDF-level dependency support coming in Spark 4.0
  • Three types of index handling in distributed environment:

    • Sequence (single node)
    • Distributed sequence (triggers shuffle)
    • Distributed (non-sequential)
  • Pandas UDF (User Defined Functions) provides way to:

    • Work with native Pandas instances
    • Execute vectorized operations
    • Handle batch processing of PySpark DataFrames
    • Maintain type safety through annotations