Developing Maintainable Data Pipelines With Jupyter and Ploomber I PyData Chicago I September Meetup

Testing Automation

Discover how to develop maintainable data pipelines with Jupyter and Ploomber, featuring modularized code, scaleable execution, and seamless Git management, ideal for research and industry applications.

Key takeaways

Jupyter Notebooks can be composed of multiple files, not just one.
Using IPYNB files can be beneficial because they can be run at scale, but using .py files is recommended for easier Git management.
Plumber allows for modularized code, making collaboration easier.
The tool can also run .py files as notebooks, creating a copy of the input file.
Jupyter Notebooks can be used for research and industry deployment.
The tool can also run .r and SQL scripts.
Input files can be Jupyter Notebooks, .py files, .r files, and SQL scripts.
The tool allows for incremental builds, which speeds up the data analysis process.
The tool can also embed tests and data quality tests.
The tool provides a GUI-like interface for creating pipelines, making it easy to compose production-ready data workflows.
The tool can be integrated with workflows like Airflow, AWS Batch, and Kubernetes.
The tool allows for custom naming of output files.
The tool makes collaboration easier by allowing data scientists to work independently and then integrate their work.
The tool can automatically detect dependencies and execute scripts in the correct order.
The tool can be used for various data science tasks such as data cleaning, data transformation, and model training.

Developing Maintainable Data Pipelines With Jupyter and Ploomber I PyData Chicago I September Meetup

More talks