Gatha Varma - Production Data to the Model: “Are You Getting My Drift?” | PyData Global 2023

Learn how to identify and mitigate data drift, a common challenge for machine learning models, and discover the types, signs, and tools for detecting and addressing this critical issue in production data.

Key takeaways
  • Data drift is a challenge for machine learning models, especially those exposed to changing data distributions.
  • When features drift, models struggle to adapt, leading to decreased performance, misleading results, and errors in deployment.
  • There are multiple types of drift: concept drift (data concepts change), covariate drift (relationships between features change), combined concept and covariate drift (features, relationship, and concept change together).
  • Drift can occur unexpectedly or gradually, making it challenging to recognize; sudden shifts, like changing user behavior, or subtle gradual changes, like shifting relationships.
  • Common signs of data drift include changes in training datasets, changes in testing results, and changes in performance metric curves.
  • Data quality, handling unexpected data, and regularly assessing model performance are vital components of mitigating data drift issues.
  • Drift doesn’t necessarily mean the model is failed or flawed; rather, it reveals the need for model tuning or adjusting the data collection process.
  • Tools, such as the Kolmogorov-Smirnov test (KS test), can help detect data drift through distribution comparisons.
  • Realizing and addressing data drift will likely require a combination of manual quality assessments, automated monitoring, and exploratory analysis using various statistical tests.

Keep in mind that various libraries are available to aid with this process, and that sample tests can be powerful initial checks for detecting data shifts in Python.