Data valuation for machine learning [PyCon DE & PyData Berlin 2024]

Learn how PyDVL helps assess training data quality through data valuation methods, explore use cases from debugging to acquisition guidance, and discover implementation best practices.

Key takeaways
  • Data valuation helps understand training point contributions to model performance, but should not be used as an automatic black box process

  • PyDVL is an open-source library that implements various data valuation methods including:

    • Global methods for overall model performance assessment
    • Local methods for point-to-point influence analysis
    • Leave-one-out and influence function approximations
  • Key use cases:

    • Debugging training data and models
    • Identifying potentially mislabeled or problematic training points
    • Acquisition guidance when dealing with multiple data sources
    • Understanding training point influences on specific test points
  • Computational challenges:

    • Direct computation is expensive (O(2^n) for n training points)
    • Solutions include Monte Carlo sampling, parallelization, and approximation techniques
    • Memory constraints with large datasets require batch processing
  • Framework support:

    • Works with NumPy, Scikit-learn
    • Uses Joblib for parallelization
    • Dask for large datasets
    • Currently PyTorch-based with planned JAX support
  • Best practices:

    • Manual inspection of flagged data points is essential
    • Use smaller/simpler proxy models for initial analysis
    • Consider validation set dependence
    • Results depend on model optimization state and chosen metrics
  • Not meant for automatic data removal or as a silver bullet solution - requires human judgment and interpretation of results