Missing Data, Bayesian Imputation and People Analytics with PyMC [PyCon DE & PyData Berlin 2024]

Learn how to handle missing data in people analytics using Bayesian imputation with PyMC. Explore statistical types of missing data, hierarchical modeling, and best practices.

Key takeaways
  • Missing data in surveys can be classified into three main types:

    • Missing Completely at Random (MCAR)
    • Missing at Random (MAR)
    • Missing Not at Random (MNAR)
  • Bayesian imputation methods are recommended for theory-informed missing data analysis because they:

    • Allow flexible model specification
    • Handle different types of distributions
    • Provide built-in sensitivity analysis
    • Enable workflow for model adequacy assessment
  • Hierarchical modeling is valuable for handling missing data because:

    • It can account for team and management structures
    • Helps isolate estimates of different impacts
    • Can transform MNAR situations into MAR situations
    • Allows for team-specific parameter estimates
  • In people analytics context:

    • Decisions about careers need justifiable models
    • Power relationships and hierarchies influence data collection
    • Survey non-response patterns may reveal organizational inefficiencies
    • Team-management mismatches can be identified
  • Technical implementation considerations:

    • Variables should be ordered by degree of missingness
    • Multiple distribution types can be handled in the same model
    • Priors should be carefully selected based on domain knowledge
    • Cross-validation and model adequacy checks are essential
  • Practical recommendations:

    • Run pilot experiments to gather information for priors
    • Consult subject matter experts for model construction
    • Validate models and repeat as necessary
    • Document your data-generating process assumptions