We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Guillaume Lemaitre - Get the best from your scikit-learn classifier | PyData Global 2023
Learn practical tips for optimizing scikit-learn classifiers, from handling imbalanced data to selecting metrics, calibrating probabilities, and incorporating business costs.
-
Class imbalance itself is not necessarily a problem - the real issue is optimizing for the right metrics and decisions
-
Use proper scoring rules (log loss, Brier score) instead of accuracy/recall when training models, as they better optimize probability estimates
-
Grid search and parameter tuning are critical - models can perform poorly without proper optimization
-
Resampling techniques can break probability calibration - if using resampling, models need to be recalibrated afterwards
-
Default decision thresholds (0.5) should not be relied upon - thresholds should be tuned based on business metrics and costs
-
Business metrics and cost-sensitive learning are preferable to statistical metrics like accuracy for real-world applications
-
New metadata routing features in scikit-learn allow incorporating business metrics and costs directly into model optimization
-
Cross-validation is essential for reliable model evaluation and comparison
-
Model calibration should be verified using reliability diagrams when probability estimates are important
-
Random forests may require tuning of parameters like maximum leaves to prevent overfitting, especially with imbalanced data