Gatha Varma - Production Data to the Model: “Are You Getting My Drift?” | PyData Global 2023

Learn how to detect and handle data drift in ML models through statistical tests, monitoring methods, and best practices for maintaining model performance over time.

Key takeaways
  • Data drift occurs when the probability distribution of input data changes over time, affecting model performance

  • Four main types of data drift:

    • Gradual drift (slow changes)
    • Sudden shifts (sharp changes)
    • Incremental drift (new distribution slowly takes over)
    • Seasonal drift (cyclical patterns)
  • Common causes of data drift:

    • Changes in data collection/preprocessing
    • External factors (demographics, geography)
    • Business changes
    • Time-based data obsolescence
    • Regulatory changes
  • Detection methods:

    • Two-sample statistical tests
    • Distribution comparisons
    • Monitoring feature correlations
    • Tracking performance metrics
    • Checking for data integrity and outliers
  • Best practices for handling drift:

    • Monitor metadata and feature distributions
    • Create subsegments for detailed analysis
    • Set percentage-based drift thresholds
    • Analyze root causes before retraining
    • Watch for integrity issues and outliers
  • Text data considerations:

    • Less prone to dramatic drift than numerical data
    • Monitor meta-features like sentiment and complexity
    • Watch for changes in language patterns and vocabulary
    • Track embedding drift
  • Recommendations after detecting drift:

    • Investigate upstream processes
    • Check for data quality issues
    • Analyze performance impact
    • Consider model recalibration
    • Retrain only after thorough analysis
  • Model monitoring should include:

    • Regular distribution checks
    • Performance metric tracking
    • Feature correlation analysis
    • Data integrity validation
    • Demographic bias assessment