Yonatan Alexander - Best practices and solution design using data versioning for machine learning

Explore best practices and solution design for machine learning with Yonatan Alexander, covering data versioning, Git management, simple models, experiment tracking, and more.

Key takeaways
  • Use data versioning for machine learning: Version your data and models to track changes and easily reproduce experiments.
  • Use Git for data management: Git can be used to manage data and models, making it easier to collaborate and track changes.
  • Keep models simple: Avoid over-engineering models and keep them simple to ensure they are reproducible and easy to maintain.
  • Upload data to a raw storage: Store data in a raw storage like S3 to avoid duplication and versioning issues.
  • Use MLflow for experiment tracking: Use MLflow to track experiments and versions of models and data.
  • Versioning data and models: Version data and models using Git to track changes and ensure reproducibility.
  • Keep code and data separate: Keep code and data separate to ensure reproducibility and ease of maintenance.
  • Use data drift alerts: Use data drift alerts to detect changes in data distribution and adjust models accordingly.
  • Keep models up-to-date: Keep models up-to-date by retraining them regularly to ensure they are accurate and effective.
  • Use data versioning for monitoring: Use data versioning to monitor and track changes in data distribution and model performance.
  • Keep experimentation reproducible: Keep experimentation reproducible by versioning data, models, and code.