Atieno Ouma - Synthetic Data for Localized Solutions | PyData Amsterdam 2024

Discover how synthetic healthcare data generation in Kenya balances cultural practices, local realities & tech constraints across 47 counties. Learn key approaches for ethical data creation.

Key takeaways
  • Healthcare data solutions in Kenya need to account for devolved healthcare systems across 47 counties with distinct local realities and cultural practices

  • Synthetic data generation must consider cultural nuances like polygamy, wife inheritance, traditional medicine practices, and local economic activities that impact healthcare access

  • Data collection involves working directly with communities through:

    • Medical tests and family history gathering
    • Community healthcare volunteers
    • Local “chamas” (community groups)
    • Direct observation rather than just questionnaires
  • Key challenges in developing localized healthcare solutions:

    • Limited technology access requiring USSD instead of smartphone apps
    • Bureaucratic obstacles
    • Infrastructure gaps (roads, electricity, water)
    • Cultural beliefs and practices
    • Language and terminology differences
  • Synthetic data generation best practices:

    • Use structured data extraction before synthesis
    • Maintain medical terminology dictionaries
    • Consider outliers carefully when scaling
    • Balance between generalization and localization
    • Validate against real patient distributions
  • Account for socioeconomic factors like:

    • Insurance coverage and accessibility
    • Mobile money usage (preferred over traditional banking)
    • Distance to healthcare facilities
    • Family financial obligations (“black tax”)
    • Education levels
  • Different synthetic data approaches needed for:

    • Tabular data (TGAN, CTGAN)
    • Images (DCGAN, StyleGAN)
    • Time series (RNN, TimeGAN)
    • Text (Sequential GANs)
  • Real and synthetic data complement each other rather than compete - choice depends on use case, privacy requirements, and data completeness needs