How to quickly build Data Pipelines for Data Scientists - Geert Jongen | PyData Eindhoven 2021

Learn how to quickly build data pipelines for data scientists, including defining pipelines, data versioning, and easy data processing, with Geert Jongen at PyData Eindhoven 2021.

Key takeaways
  • Define data pipelines: Data pipelines for data scientists are essential to quickly build and manage data workflows.
  • Importance of data versioning: Data versioning is crucial for data consistency and reproducibility.
  • Delta Lake: Delta Lake is a solution for data versioning and schema management.
  • Easy data processing: Use notebooks to process data and avoid manual copying and pasting.
  • Version control: Implement version control systems like Git to manage data changes.
  • Scheduling: Schedule data pipelines to run automatically at regular intervals.
  • Data quality: Data quality is a critical aspect of data pipelines, especially for machine learning models.
  • Training models: Train new models on old data to improve performance.
  • Reproducibility: Make data pipelines reproducible to ensure consistent results.
  • Use notebooks for data analysis: Use notebooks to analyze and process data, not just for visualization.
  • Data engineering: Data engineering is a critical part of data pipelines, ensuring data consistency and reproducibility.
  • Meet the author: Geert Jongen, a data engineer and scientist, shares his expertise on building data pipelines.