Guillaume Lemaitre - Get the best from your scikit-learn classifier | PyData Global 2023

Learn practical tips for optimizing scikit-learn classifiers, from handling imbalanced data to selecting metrics, calibrating probabilities, and incorporating business costs.

Key takeaways
  • Class imbalance itself is not necessarily a problem - the real issue is optimizing for the right metrics and decisions

  • Use proper scoring rules (log loss, Brier score) instead of accuracy/recall when training models, as they better optimize probability estimates

  • Grid search and parameter tuning are critical - models can perform poorly without proper optimization

  • Resampling techniques can break probability calibration - if using resampling, models need to be recalibrated afterwards

  • Default decision thresholds (0.5) should not be relied upon - thresholds should be tuned based on business metrics and costs

  • Business metrics and cost-sensitive learning are preferable to statistical metrics like accuracy for real-world applications

  • New metadata routing features in scikit-learn allow incorporating business metrics and costs directly into model optimization

  • Cross-validation is essential for reliable model evaluation and comparison

  • Model calibration should be verified using reliability diagrams when probability estimates are important

  • Random forests may require tuning of parameters like maximum leaves to prevent overfitting, especially with imbalanced data