Guillaume Lemaitre - Get the best from your scikit-learn classifier | PyData Global 2023

Discover strategies to improve scikit-learn classifier performance, including tuning, calibration, and proper scoring rules, and learn how to optimize for business metrics to achieve reliable and effective model results.

Key takeaways

Resampling is not a good approach for class imbalance: It’s not a proper solution and can actually make things worse.
Tuning the model is important: Grid search and hyperparameter tuning are crucial to optimize the model.
Use proper scoring rules: Log loss, Brier score, and other proper scoring rules are more effective than accuracy, precision, and recall.
Business metrics are important: Define a business metric that aligns with the problem you’re trying to solve and optimize for that.
Calibration is key: Make sure the model is well-calibrated to avoid overfitting.
Random forest can be improved: Balanced random forest can be a good approach to handle class imbalance.
Resampling can be problematic: It can mess up the calibration of the model and lead to overfitting.
Grid search can be useful: Use grid search to tune the hyperparameters of the model.
Thresholding is important: Tune the threshold to optimize the model for the specific problem.
Resampling is not necessary: If you’re using a well-calibrated model, resampling may not be necessary.
Use metadata: Use metadata to define the business metric and optimize the model.
Imbalanced classification is a problem: Imbalanced classification can lead to overfitting and poor performance.
Proper calibration is important: Proper calibration is important to avoid overfitting and ensure the model is reliable.
Business metrics are the goal: The goal is to optimize the business metric, not just the accuracy of the model.

Guillaume Lemaitre - Get the best from your scikit-learn classifier | PyData Global 2023

More talks