Martin Trautmann - pydiverse pipedag: A library for data pipeline orchestration

Learn about pydiverse pipedag, a data pipeline library enabling fast iteration, auto cache invalidation & schema swapping. See how it bridges data science & software engineering.

Key takeaways
  • PyDiverse pipedag is designed for data pipeline orchestration with focus on high iteration speed and automated cache invalidation

  • Key recommendations for effective pipeline development:

    • Use multiple pipeline instances rather than developing on one large pipeline
    • Utilize full software engineering toolset (debuggers, CI/CD, testing)
    • Keep initial pipeline runs fast (seconds) by using minimal input data
    • Enable gradual codebase improvement without big-bang migrations
    • Combine data scientist and software engineer workflows closely
  • The library supports multiple syntax options:

    • Raw SQL
    • Pandas
    • Polars (both eager and lazy)
    • Ibis expressions
    • SQL Alchemy
  • Core features:

    • Automatic cache invalidation - only changed components need rerunning
    • Schema swapping and stage level transactions
    • Integration with various backend engines (Dask, Prefect)
    • Table store abstraction (primarily for relational databases)
  • Optimization for iteration speed through:

    • Fast debugging with break points
    • Easy query exploration and testing
    • Minimal boilerplate code
    • Support for interactive development
    • Automatic detection of dependencies and changes
  • Best suited for:

    • Projects dealing with tabular data (100s GB to few TB)
    • Teams of 5-10 people collaborating on pipelines
    • Economic data analysis and machine learning workflows
    • Environments requiring high iteration speed for model improvement