Maico Timmerman - From Data Pipelines to a Data Platform: Embracing Monorepo Architecture

Testing Automation Devops

Learn how monorepo architecture can transform data pipelines into a scalable data platform, with insights on tooling, automation, team autonomy, and managing large-scale systems.

Key takeaways

Monorepo architecture helps improve cross-team collaboration and dependency management by keeping all code in one repository, making changes and testing more streamlined
Important to invest early in tooling and automation for monorepos to handle scale:
- Test automation
- Dependency management
- Build and deployment pipelines
- Access controls
- Code review workflows
Two key rules for monorepo organization:
- Applications should be independent leaf nodes
- No circular dependencies between packages/libraries
Scale metrics of their monorepo:
- 850 pipelines
- 2500 materialized tables
- 500 nodes with 25,000 CPUs
- 240 TB memory
- 23 PB storage
Benefits of monorepo for data platforms:
- Easier data validation across teams
- Simplified dependency management
- Better visibility into data lineage
- Faster rollout of changes
- Improved testing capabilities
Challenges to address:
- CI/CD complexity
- Test suite management at scale
- Access controls for sensitive data
- Dependency hell prevention
- Maintaining high speed of execution
Teams maintain autonomy within their own applications while benefiting from shared libraries and tooling
Critical to establish conventions and golden paths for teams to follow while allowing experimentation within bounds
Security and compliance become easier to manage with centralized access controls and visibility
Migration and large-scale changes become more manageable through automated tooling and testing

Maico Timmerman - From Data Pipelines to a Data Platform: Embracing Monorepo Architecture

More talks