Talks - Juliana Ferreira Alves: Improve Your ML Projects: Embrace Reproducibility and Production...

Learn how to improve machine learning projects with Kedro, a Python framework for building reproducible data pipelines that bridge data science and production.

Key takeaways
  • Kedro is a Python framework for creating scalable data science pipelines that emphasizes reproducibility and production readiness

  • Key features of Kedro:

    • Project template with standardized directory structure
    • Data catalog for managing data sources and connections
    • Pipeline organization with nodes (processing steps)
    • Automatic experiment tracking and metrics logging
    • Visualization tools like CasualVis for pipeline inspection
  • Helps bridge the gap between data scientists and ML engineers:

    • Makes code more production-ready
    • Improves communication between team members
    • Standardizes project structure
    • Makes projects more reproducible and shareable
  • Supports multiple data sources and environments:

    • Local files
    • Cloud storage (S3, GCP, Azure)
    • Hadoop filesystems
    • HTTP endpoints
    • Multiple runtime environments (dev, prod)
  • Best practices for using Kedro:

    • Start experimentation in notebooks
    • Move successful experiments to Kedro pipelines
    • Use configuration files for parameters
    • Store metrics and model artifacts systematically
    • Implement version control
    • Create Docker containers for deployment
  • Not meant to replace:

    • Data infrastructure
    • ML ops frameworks
    • Other orchestration tools
    • Experimentation environments