Martin Trautmann - pydiverse pipedag: A library for data pipeline orchestration

Python Testing Automation

Learn about pydiverse pipedag, a data pipeline library enabling fast iteration, auto cache invalidation & schema swapping. See how it bridges data science & software engineering.

Key takeaways

PyDiverse pipedag is designed for data pipeline orchestration with focus on high iteration speed and automated cache invalidation
Key recommendations for effective pipeline development:
- Use multiple pipeline instances rather than developing on one large pipeline
- Utilize full software engineering toolset (debuggers, CI/CD, testing)
- Keep initial pipeline runs fast (seconds) by using minimal input data
- Enable gradual codebase improvement without big-bang migrations
- Combine data scientist and software engineer workflows closely
The library supports multiple syntax options:
- Raw SQL
- Pandas
- Polars (both eager and lazy)
- Ibis expressions
- SQL Alchemy
Core features:
- Automatic cache invalidation - only changed components need rerunning
- Schema swapping and stage level transactions
- Integration with various backend engines (Dask, Prefect)
- Table store abstraction (primarily for relational databases)
Optimization for iteration speed through:
- Fast debugging with break points
- Easy query exploration and testing
- Minimal boilerplate code
- Support for interactive development
- Automatic detection of dependencies and changes
Best suited for:
- Projects dealing with tabular data (100s GB to few TB)
- Teams of 5-10 people collaborating on pipelines
- Economic data analysis and machine learning workflows
- Environments requiring high iteration speed for model improvement

Martin Trautmann - pydiverse pipedag: A library for data pipeline orchestration

More talks