We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Data valuation for machine learning [PyCon DE & PyData Berlin 2024]
Learn how PyDVL helps assess training data quality through data valuation methods, explore use cases from debugging to acquisition guidance, and discover implementation best practices.
-
Data valuation helps understand training point contributions to model performance, but should not be used as an automatic black box process
-
PyDVL is an open-source library that implements various data valuation methods including:
- Global methods for overall model performance assessment
- Local methods for point-to-point influence analysis
- Leave-one-out and influence function approximations
-
Key use cases:
- Debugging training data and models
- Identifying potentially mislabeled or problematic training points
- Acquisition guidance when dealing with multiple data sources
- Understanding training point influences on specific test points
-
Computational challenges:
- Direct computation is expensive (O(2^n) for n training points)
- Solutions include Monte Carlo sampling, parallelization, and approximation techniques
- Memory constraints with large datasets require batch processing
-
Framework support:
- Works with NumPy, Scikit-learn
- Uses Joblib for parallelization
- Dask for large datasets
- Currently PyTorch-based with planned JAX support
-
Best practices:
- Manual inspection of flagged data points is essential
- Use smaller/simpler proxy models for initial analysis
- Consider validation set dependence
- Results depend on model optimization state and chosen metrics
-
Not meant for automatic data removal or as a silver bullet solution - requires human judgment and interpretation of results