We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
How to quickly build Data Pipelines for Data Scientists - Geert Jongen | PyData Eindhoven 2021
Learn how to quickly build data pipelines for data scientists, including defining pipelines, data versioning, and easy data processing, with Geert Jongen at PyData Eindhoven 2021.
- Define data pipelines: Data pipelines for data scientists are essential to quickly build and manage data workflows.
- Importance of data versioning: Data versioning is crucial for data consistency and reproducibility.
- Delta Lake: Delta Lake is a solution for data versioning and schema management.
- Easy data processing: Use notebooks to process data and avoid manual copying and pasting.
- Version control: Implement version control systems like Git to manage data changes.
- Scheduling: Schedule data pipelines to run automatically at regular intervals.
- Data quality: Data quality is a critical aspect of data pipelines, especially for machine learning models.
- Training models: Train new models on old data to improve performance.
- Reproducibility: Make data pipelines reproducible to ensure consistent results.
- Use notebooks for data analysis: Use notebooks to analyze and process data, not just for visualization.
- Data engineering: Data engineering is a critical part of data pipelines, ensuring data consistency and reproducibility.
- Meet the author: Geert Jongen, a data engineer and scientist, shares his expertise on building data pipelines.