Gatha Varma - Production Data to the Model: “Are You Getting My Drift?” | PyData Global 2023

Gatha Varma

Learn how to detect and handle data drift in ML models through statistical tests, monitoring methods, and best practices for maintaining model performance over time.

Key takeaways
  • Data drift occurs when the probability distribution of input data changes over time, affecting model performance

  • Four main types of data drift:

    • Gradual drift (slow changes)
    • Sudden shifts (sharp changes)
    • Incremental drift (new distribution slowly takes over)
    • Seasonal drift (cyclical patterns)
  • Common causes of data drift:

    • Changes in data collection/preprocessing
    • External factors (demographics, geography)
    • Business changes
    • Time-based data obsolescence
    • Regulatory changes
  • Detection methods:

    • Two-sample statistical tests
    • Distribution comparisons
    • Monitoring feature correlations
    • Tracking performance metrics
    • Checking for data integrity and outliers
  • Best practices for handling drift:

    • Monitor metadata and feature distributions
    • Create subsegments for detailed analysis
    • Set percentage-based drift thresholds
    • Analyze root causes before retraining
    • Watch for integrity issues and outliers
  • Text data considerations:

    • Less prone to dramatic drift than numerical data
    • Monitor meta-features like sentiment and complexity
    • Watch for changes in language patterns and vocabulary
    • Track embedding drift
  • Recommendations after detecting drift:

    • Investigate upstream processes
    • Check for data quality issues
    • Analyze performance impact
    • Consider model recalibration
    • Retrain only after thorough analysis
  • Model monitoring should include:

    • Regular distribution checks
    • Performance metric tracking
    • Feature correlation analysis
    • Data integrity validation
    • Demographic bias assessment