The pragmatic Pythonic data engineer [PyCon DE & PyData Berlin 2024]

Python Testing

Learn pragmatic Python data engineering practices covering data validation, Apache Airflow orchestration, PySpark processing, and monitoring in production environments.

Key takeaways

Data engineering requires a pragmatic approach focused on solving real problems rather than chasing trendy tools or silver bullets
Key components of a data product architecture:
- Data sources and ingestion
- Common data format for transfer/storage (like Parquet/Arrow)
- Data validation and testing
- Orchestration and monitoring
- Serving layer for consumption
Data validation is critical:
- Use tools like Great Expectations for data quality checks
- Pydantic for business validation rules
- Test data as rigorously as code
- Implement validation in production pipelines
Apache Airflow best practices:
- Organize DAGs in standard folder structure
- Use templating for reusability
- Implement proper testing of DAGs and operators
- Leverage built-in monitoring and alerting
- Create custom operators when needed
PySpark considerations:
- Handles Python-Java serialization automatically
- Good for large-scale data processing
- Works well with pandas DataFrames
- Supports both batch and streaming
- Native Python integration without heavy Java configuration
Data scale and architecture choices:
- Consider data volume when choosing tools
- Use partitioning for large datasets
- Balance cost vs performance
- Traditional methods still valid for smaller scales
- Think about future scaling needs
Observability and monitoring:
- Track pipeline health
- Monitor data quality
- Set up alerts for validation errors
- Use metrics for optimization
- Integrate with existing monitoring tools

The pragmatic Pythonic data engineer [PyCon DE & PyData Berlin 2024]

More talks