Continuous Delivery for Data • Dave Farley • YOW! 2020

Dave Farley

Learn how to apply Continuous Delivery principles to data changes, with proven patterns for safe schema migrations, version control, testing, and managing ML models in production.

Key takeaways
  • Continuous Delivery should be applied to data changes as well as code changes - the ability to deploy schema changes safely is crucial for true CD

  • Three main categories of data to consider:

    • Transactional data (generated during system operation)
    • Reference/lookup data (static, read-only)
    • Configuration data (defines system behavior)
  • Key data migration patterns:

    • Deployment time migration (simple but requires downtime)
    • Lazy reader (translates on read, good for hot deployments)
    • Lazy migrator (background migration during idle time)
  • Version control everything:

    • Schema changes
    • Migration scripts
    • Data models
    • Configuration
    • Infrastructure code
  • Best practices for schema changes:

    • Make additive changes when possible
    • Version schemas with sequential numbers
    • Keep schema version info with application code
    • Write and test both upgrade and rollback scripts
    • Test migrations thoroughly
  • Data testing approaches:

    • Generate synthetic test data in test scope
    • Avoid using production data for tests
    • Focus on testing migration logic, not just final state
    • Include migration tests in CI pipeline
  • For machine learning systems:

    • Version control training data and models
    • Create deployment pipelines for ML models
    • Monitor model performance in production
    • Enable A/B testing of models
    • Plan for model updates and retraining
  • Make systems deterministic and repeatable:

    • Use infrastructure as code
    • Automate environment setup
    • Version all dependencies together
    • Enable rolling back to previous states
  • Design systems to handle evolution:

    • Allow structure to change over time
    • Plan for data migration needs upfront
    • Keep old data versions readable
    • Build migration capabilities into applications
  • Focus on fast feedback loops:

    • Automate testing and deployment
    • Make changes in small increments
    • Validate changes early
    • Monitor results in production