Olivier Grisel - Predictive survival analysis with scikit-learn, scikit-survival and lifelines

Learn how to perform predictive survival analysis in Python using scikit-learn, scikit-survival & lifelines. Master key concepts, models & applications in this tutorial.

Key takeaways
  • Survival analysis deals with right-censored time-to-event data, where some observations don’t experience the event during the study period

  • The Kaplan-Meier estimator provides an unbiased estimate of survival probabilities even with censored data, serving as a baseline non-conditional model

  • Two key metrics for evaluating survival models:

    • Integrated Brier Score (IBS) - measures calibration and discrimination
    • Concordance Index - measures ranking/discriminative ability only
  • Cox Proportional Hazards is a popular predictive model for survival analysis, but has limitations like not allowing survival curves to cross

  • More flexible models available:

    • Gradient Boosting Incidents
    • Survival Forests
    • Can capture non-linear interactions between features
  • Key Python libraries for survival analysis:

    • lifelines: Core survival analysis functionality
    • scikit-survival: Extension of scikit-learn for survival
    • hazardous: Experimental library with newer models
  • Naive approaches like discarding censored data or imputing with large values introduce significant bias

  • Survival analysis has applications in:

    • Medical research (patient survival)
    • Predictive maintenance
    • Customer churn
    • Insurance claim modeling
  • The hazard rate represents the instantaneous risk of event occurrence, conditional on survival up to that point

  • Feature preprocessing like splines and polynomial features can help capture non-linear relationships in survival models