Wojtek Kuberski - The ML Monitoring Flow for Models Deployed to Production | PyData Amsterdam 2024

Learn how to effectively monitor ML models in production, detect performance degradation, and implement best practices for model maintenance and selective retraining.

Key takeaways
  • Models deployed to production commonly experience performance degradation over time, with studies showing ~20% average degradation and some models degrading >80%

  • Two main types of model drift:

    • Covariate shift: Changes in input data distribution
    • Concept drift: Changes in relationship between features and target
  • Traditional data drift detection methods have limitations:

    • High false positive rates
    • Cannot reliably indicate actual model performance impact
    • Univariate drift methods miss important multivariate changes
  • Key monitoring approaches:

    • Confidence-Based Performance Estimation (CBPE) for classification tasks
    • Direct Loss Estimation (PAPE) for regression tasks
    • Model calibration to get reliable probability estimates
    • Estimating performance metrics without access to ground truth labels
  • Best practices for production ML monitoring:

    • Don’t rely solely on data drift signals
    • Consider business impact and costs of false positives/negatives
    • Monitor performance across different data segments
    • Set up early warning systems before business impact occurs
    • Account for seasonality in monitoring metrics
  • Model retraining considerations:

    • Retrain selectively based on detected concept drift
    • Retraining may not help if issue is pure covariate shift
    • Focus retraining on specific data segments showing degradation
    • Validate retraining impact with proper performance metrics