Maico Timmerman - From Data Pipelines to a Data Platform: Embracing Monorepo Architecture

Learn how monorepo architecture can transform data pipelines into a scalable data platform, with insights on tooling, automation, team autonomy, and managing large-scale systems.

Key takeaways
  • Monorepo architecture helps improve cross-team collaboration and dependency management by keeping all code in one repository, making changes and testing more streamlined

  • Important to invest early in tooling and automation for monorepos to handle scale:

    • Test automation
    • Dependency management
    • Build and deployment pipelines
    • Access controls
    • Code review workflows
  • Two key rules for monorepo organization:

    • Applications should be independent leaf nodes
    • No circular dependencies between packages/libraries
  • Scale metrics of their monorepo:

    • 850 pipelines
    • 2500 materialized tables
    • 500 nodes with 25,000 CPUs
    • 240 TB memory
    • 23 PB storage
  • Benefits of monorepo for data platforms:

    • Easier data validation across teams
    • Simplified dependency management
    • Better visibility into data lineage
    • Faster rollout of changes
    • Improved testing capabilities
  • Challenges to address:

    • CI/CD complexity
    • Test suite management at scale
    • Access controls for sensitive data
    • Dependency hell prevention
    • Maintaining high speed of execution
  • Teams maintain autonomy within their own applications while benefiting from shared libraries and tooling

  • Critical to establish conventions and golden paths for teams to follow while allowing experimentation within bounds

  • Security and compliance become easier to manage with centralized access controls and visibility

  • Migration and large-scale changes become more manageable through automated tooling and testing