Eitan Netzer & Oren Netzer - Real Time Machine Learning | PyData Global 2023

Learn how Core Sets and the Data Heroes Framework enable efficient real-time ML with automated retraining, distributed processing, and massive compute cost savings.

Key takeaways
  • Core sets provide a weighted subset of data that preserves statistical properties while significantly reducing training time and compute costs

  • The Data Heroes Real-Time ML Framework enables:

    • Automated high-frequency model retraining
    • Training on multiple date ranges
    • Efficient hyperparameter tuning
    • Distributed processing across geographic locations
  • Model retraining frequency improvements:

    • Monthly retraining saved 82% compute costs
    • Weekly retraining improved accuracy by 22%
    • Daily retraining increased accuracy by up to 20%
  • Core set tree structure benefits:

    • Infinitely scalable and distributed
    • Built once but usable multiple times
    • Allows training on any subset of data
    • Processing time stays consistent as data grows
  • Real-world case study results:

    • Reduced hyperparameter tuning time from 154 hours to 5 hours
    • Maintained or improved model accuracy vs full dataset training
    • Enabled expansion from 144 to 864 hyperparameter combinations
    • Achieved 11% average accuracy increase over 26 weeks
  • Implementation features:

    • No data needs to leave local environment
    • Works with existing ML libraries (XGBoost, LightGBM, etc.)
    • Supports both classification and regression
    • Automated data structure conversion