Gatha Varma - Production Data to the Model: “Are You Getting My Drift?” | PyData Global 2023

Learn how to detect and handle data drift in ML models through statistical tests, monitoring methods, and best practices for maintaining model performance over time.

Key takeaways

Data drift occurs when the probability distribution of input data changes over time, affecting model performance
Four main types of data drift:
- Gradual drift (slow changes)
- Sudden shifts (sharp changes)
- Incremental drift (new distribution slowly takes over)
- Seasonal drift (cyclical patterns)
Common causes of data drift:
- Changes in data collection/preprocessing
- External factors (demographics, geography)
- Business changes
- Time-based data obsolescence
- Regulatory changes
Detection methods:
- Two-sample statistical tests
- Distribution comparisons
- Monitoring feature correlations
- Tracking performance metrics
- Checking for data integrity and outliers
Best practices for handling drift:
- Monitor metadata and feature distributions
- Create subsegments for detailed analysis
- Set percentage-based drift thresholds
- Analyze root causes before retraining
- Watch for integrity issues and outliers
Text data considerations:
- Less prone to dramatic drift than numerical data
- Monitor meta-features like sentiment and complexity
- Watch for changes in language patterns and vocabulary
- Track embedding drift
Recommendations after detecting drift:
- Investigate upstream processes
- Check for data quality issues
- Analyze performance impact
- Consider model recalibration
- Retrain only after thorough analysis
Model monitoring should include:
- Regular distribution checks
- Performance metric tracking
- Feature correlation analysis
- Data integrity validation
- Demographic bias assessment

Gatha Varma - Production Data to the Model: “Are You Getting My Drift?” | PyData Global 2023

More talks