Jayce @ BETA - Big Data Engineering With Python and AWS | PyData Vermont 2024

Learn how BETA Technologies processes vast amounts of aircraft data using Python and AWS. Explore their data lake architecture, processing patterns, and tools for handling 11GB/hour of flight data.

Key takeaways
  • Beta Technologies builds electric aircraft prototypes with extensive data collection capabilities - ~11GB/hour video and 570 million telemetry points/hour

  • Core data stack uses AWS services:

    • S3 for data lake storage
    • DynamoDB for metadata
    • Redshift for data warehouse
    • Fargate for container orchestration
    • Managed Airflow for workflow orchestration
  • Data architecture follows medallion pattern:

    • Bronze: raw ingested data
    • Silver: processed queryable data
    • Gold: transformed data ready for BI/dashboards
  • Uses micro-batch processing instead of Spark to handle time series data, avoiding complexity of distributed computing

  • All data is stored in “tall” format (timestamp, field, value) for flexibility and scalability

  • Grafana serves as primary BI tool, chosen for:

    • Time series optimization
    • Technical user base compatibility
    • Interactive visualizations
    • Video sync capabilities
  • Custom Python tooling handles data decoding and processing from multiple sources:

    • Aircraft sensors
    • Video feeds
    • Simulation data
    • Test environment data
  • Focus on enabling engineers rather than replacing them - ML models augment human expertise

  • Total data volume around 150 terabytes from two aircraft prototypes

  • System designed for extensibility with minimal downstream impact when adding new sensors or data sources