Developing Maintainable Data Pipelines With Jupyter and Ploomber I PyData Chicago I September Meetup

Discover how to develop maintainable data pipelines with Jupyter and Ploomber, featuring modularized code, scaleable execution, and seamless Git management, ideal for research and industry applications.

Key takeaways
  • Jupyter Notebooks can be composed of multiple files, not just one.
  • Using IPYNB files can be beneficial because they can be run at scale, but using .py files is recommended for easier Git management.
  • Plumber allows for modularized code, making collaboration easier.
  • The tool can also run .py files as notebooks, creating a copy of the input file.
  • Jupyter Notebooks can be used for research and industry deployment.
  • The tool can also run .r and SQL scripts.
  • Input files can be Jupyter Notebooks, .py files, .r files, and SQL scripts.
  • The tool allows for incremental builds, which speeds up the data analysis process.
  • The tool can also embed tests and data quality tests.
  • The tool provides a GUI-like interface for creating pipelines, making it easy to compose production-ready data workflows.
  • The tool can be integrated with workflows like Airflow, AWS Batch, and Kubernetes.
  • The tool allows for custom naming of output files.
  • The tool makes collaboration easier by allowing data scientists to work independently and then integrate their work.
  • The tool can automatically detect dependencies and execute scripts in the correct order.
  • The tool can be used for various data science tasks such as data cleaning, data transformation, and model training.