The pragmatic Pythonic data engineer [PyCon DE & PyData Berlin 2024]

Learn pragmatic Python data engineering practices covering data validation, Apache Airflow orchestration, PySpark processing, and monitoring in production environments.

Key takeaways
  • Data engineering requires a pragmatic approach focused on solving real problems rather than chasing trendy tools or silver bullets

  • Key components of a data product architecture:

    • Data sources and ingestion
    • Common data format for transfer/storage (like Parquet/Arrow)
    • Data validation and testing
    • Orchestration and monitoring
    • Serving layer for consumption
  • Data validation is critical:

    • Use tools like Great Expectations for data quality checks
    • Pydantic for business validation rules
    • Test data as rigorously as code
    • Implement validation in production pipelines
  • Apache Airflow best practices:

    • Organize DAGs in standard folder structure
    • Use templating for reusability
    • Implement proper testing of DAGs and operators
    • Leverage built-in monitoring and alerting
    • Create custom operators when needed
  • PySpark considerations:

    • Handles Python-Java serialization automatically
    • Good for large-scale data processing
    • Works well with pandas DataFrames
    • Supports both batch and streaming
    • Native Python integration without heavy Java configuration
  • Data scale and architecture choices:

    • Consider data volume when choosing tools
    • Use partitioning for large datasets
    • Balance cost vs performance
    • Traditional methods still valid for smaller scales
    • Think about future scaling needs
  • Observability and monitoring:

    • Track pipeline health
    • Monitor data quality
    • Set up alerts for validation errors
    • Use metrics for optimization
    • Integrate with existing monitoring tools