Shaurya Agarwal - All Them Data Engines: Data Munging with Python circa 2023 | PyData Global 2023

Explore Python's data processing tools from lists to Spark, learning when to use each for optimal performance. Compare eager vs lazy evaluation and key optimization strategies.

Key takeaways
  • Python evaluation is eager by default - it processes data immediately, which can be inefficient for large datasets

  • Key data structures have different performance characteristics:

    • Lists: Flexible but have overhead due to dynamic typing
    • NumPy arrays: 20x+ faster than lists due to contiguous memory and uniform typing
    • Pandas: Builds on NumPy, adds convenient data analysis features but slower for large data
    • Spark: Good for data larger than memory, but adds operational complexity
  • Data type handling is critical for performance:

    • Python’s duck typing adds overhead but provides flexibility
    • NumPy requires uniform types but gains significant speed
    • Being explicit about data types helps optimization
  • Data representation matters:

    • Row vs column orientation affects performance
    • Contiguous memory access is faster than scattered access
    • Memory overhead grows with method count and dynamic features
  • Key optimizations to consider:

    • Filter data early before processing
    • Use predicate pushdown when possible
    • Consider memory constraints and data volume
    • Choose between eager vs lazy evaluation based on needs
    • Leverage vectorized operations over loops
  • Start with simpler tools first:

    • Use vanilla Python for basic analysis
    • Move to pandas for structured data analysis
    • Consider Spark only when data exceeds single machine capacity
    • Match tool complexity to actual requirements
  • Data parsing considerations:

    • Handle CSV dialects and quotation carefully
    • Consider file headers and schema definitions
    • Plan for data type conversion and cleaning
    • Watch for edge cases in real data
  • Focus on understanding the data first, then optimize:

    • Start with working code before optimization
    • Build reusable data cleaning functions
    • Document data idiosyncrasies
    • Test with realistic data volumes