We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Shaurya Agarwal - All Them Data Engines: Data Munging with Python circa 2023 | PyData Global 2023
Explore Python's data processing tools from lists to Spark, learning when to use each for optimal performance. Compare eager vs lazy evaluation and key optimization strategies.
-
Python evaluation is eager by default - it processes data immediately, which can be inefficient for large datasets
-
Key data structures have different performance characteristics:
- Lists: Flexible but have overhead due to dynamic typing
- NumPy arrays: 20x+ faster than lists due to contiguous memory and uniform typing
- Pandas: Builds on NumPy, adds convenient data analysis features but slower for large data
- Spark: Good for data larger than memory, but adds operational complexity
-
Data type handling is critical for performance:
- Python’s duck typing adds overhead but provides flexibility
- NumPy requires uniform types but gains significant speed
- Being explicit about data types helps optimization
-
Data representation matters:
- Row vs column orientation affects performance
- Contiguous memory access is faster than scattered access
- Memory overhead grows with method count and dynamic features
-
Key optimizations to consider:
- Filter data early before processing
- Use predicate pushdown when possible
- Consider memory constraints and data volume
- Choose between eager vs lazy evaluation based on needs
- Leverage vectorized operations over loops
-
Start with simpler tools first:
- Use vanilla Python for basic analysis
- Move to pandas for structured data analysis
- Consider Spark only when data exceeds single machine capacity
- Match tool complexity to actual requirements
-
Data parsing considerations:
- Handle CSV dialects and quotation carefully
- Consider file headers and schema definitions
- Plan for data type conversion and cleaning
- Watch for edge cases in real data
-
Focus on understanding the data first, then optimize:
- Start with working code before optimization
- Build reusable data cleaning functions
- Document data idiosyncrasies
- Test with realistic data volumes