Guillaume Lemaitre - Get the best from your scikit-learn classifier | PyData Global 2023

Python

Learn practical tips for optimizing scikit-learn classifiers, from handling imbalanced data to selecting metrics, calibrating probabilities, and incorporating business costs.

Key takeaways

Class imbalance itself is not necessarily a problem - the real issue is optimizing for the right metrics and decisions
Use proper scoring rules (log loss, Brier score) instead of accuracy/recall when training models, as they better optimize probability estimates
Grid search and parameter tuning are critical - models can perform poorly without proper optimization
Resampling techniques can break probability calibration - if using resampling, models need to be recalibrated afterwards
Default decision thresholds (0.5) should not be relied upon - thresholds should be tuned based on business metrics and costs
Business metrics and cost-sensitive learning are preferable to statistical metrics like accuracy for real-world applications
New metadata routing features in scikit-learn allow incorporating business metrics and costs directly into model optimization
Cross-validation is essential for reliable model evaluation and comparison
Model calibration should be verified using reliability diagrams when probability estimates are important
Random forests may require tuning of parameters like maximum leaves to prevent overfitting, especially with imbalanced data

Guillaume Lemaitre - Get the best from your scikit-learn classifier | PyData Global 2023

More talks