Guillaume Lemaitre - Get the best from your scikit-learn classifier | PyData Global 2023

Discover strategies to improve scikit-learn classifier performance, including tuning, calibration, and proper scoring rules, and learn how to optimize for business metrics to achieve reliable and effective model results.

Key takeaways
  • Resampling is not a good approach for class imbalance: It’s not a proper solution and can actually make things worse.
  • Tuning the model is important: Grid search and hyperparameter tuning are crucial to optimize the model.
  • Use proper scoring rules: Log loss, Brier score, and other proper scoring rules are more effective than accuracy, precision, and recall.
  • Business metrics are important: Define a business metric that aligns with the problem you’re trying to solve and optimize for that.
  • Calibration is key: Make sure the model is well-calibrated to avoid overfitting.
  • Random forest can be improved: Balanced random forest can be a good approach to handle class imbalance.
  • Resampling can be problematic: It can mess up the calibration of the model and lead to overfitting.
  • Grid search can be useful: Use grid search to tune the hyperparameters of the model.
  • Thresholding is important: Tune the threshold to optimize the model for the specific problem.
  • Resampling is not necessary: If you’re using a well-calibrated model, resampling may not be necessary.
  • Use metadata: Use metadata to define the business metric and optimize the model.
  • Imbalanced classification is a problem: Imbalanced classification can lead to overfitting and poor performance.
  • Proper calibration is important: Proper calibration is important to avoid overfitting and ensure the model is reliable.
  • Business metrics are the goal: The goal is to optimize the business metric, not just the accuracy of the model.