We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Guillaume Lemaitre - Get the best from your scikit-learn classifier | PyData Global 2023
Learn practical tips for optimizing scikit-learn classifiers, from handling imbalanced data to selecting metrics, calibrating probabilities, and incorporating business costs.
- 
    Class imbalance itself is not necessarily a problem - the real issue is optimizing for the right metrics and decisions 
- 
    Use proper scoring rules (log loss, Brier score) instead of accuracy/recall when training models, as they better optimize probability estimates 
- 
    Grid search and parameter tuning are critical - models can perform poorly without proper optimization 
- 
    Resampling techniques can break probability calibration - if using resampling, models need to be recalibrated afterwards 
- 
    Default decision thresholds (0.5) should not be relied upon - thresholds should be tuned based on business metrics and costs 
- 
    Business metrics and cost-sensitive learning are preferable to statistical metrics like accuracy for real-world applications 
- 
    New metadata routing features in scikit-learn allow incorporating business metrics and costs directly into model optimization 
- 
    Cross-validation is essential for reliable model evaluation and comparison 
- 
    Model calibration should be verified using reliability diagrams when probability estimates are important 
- 
    Random forests may require tuning of parameters like maximum leaves to prevent overfitting, especially with imbalanced data