Continuous Delivery for Data • Dave Farley • YOW! 2020

Learn how to apply Continuous Delivery principles to data changes, with proven patterns for safe schema migrations, version control, testing, and managing ML models in production.

Key takeaways
  • Continuous Delivery should be applied to data changes as well as code changes - the ability to deploy schema changes safely is crucial for true CD

  • Three main categories of data to consider:

    • Transactional data (generated during system operation)
    • Reference/lookup data (static, read-only)
    • Configuration data (defines system behavior)
  • Key data migration patterns:

    • Deployment time migration (simple but requires downtime)
    • Lazy reader (translates on read, good for hot deployments)
    • Lazy migrator (background migration during idle time)
  • Version control everything:

    • Schema changes
    • Migration scripts
    • Data models
    • Configuration
    • Infrastructure code
  • Best practices for schema changes:

    • Make additive changes when possible
    • Version schemas with sequential numbers
    • Keep schema version info with application code
    • Write and test both upgrade and rollback scripts
    • Test migrations thoroughly
  • Data testing approaches:

    • Generate synthetic test data in test scope
    • Avoid using production data for tests
    • Focus on testing migration logic, not just final state
    • Include migration tests in CI pipeline
  • For machine learning systems:

    • Version control training data and models
    • Create deployment pipelines for ML models
    • Monitor model performance in production
    • Enable A/B testing of models
    • Plan for model updates and retraining
  • Make systems deterministic and repeatable:

    • Use infrastructure as code
    • Automate environment setup
    • Version all dependencies together
    • Enable rolling back to previous states
  • Design systems to handle evolution:

    • Allow structure to change over time
    • Plan for data migration needs upfront
    • Keep old data versions readable
    • Build migration capabilities into applications
  • Focus on fast feedback loops:

    • Automate testing and deployment
    • Make changes in small increments
    • Validate changes early
    • Monitor results in production