Failure & Change: Principles of Reliable Systems • Mark Hibberd • YOW! 2018

Learn principles for building reliable systems with Mark Hibberd: timeouts, service independence, graceful degradation, immutable logs, and testing against live traffic.

Key takeaways
  • Being reliable means consistently performing well - serving some requests is better than serving none when under stress

  • Timeouts are critical for reliability - they prevent cascading failures and help systems recover. Need aggressive timeouts with exponential backoff

  • Service independence is key - avoid coupling through shared data stores, health checks, or deployment processes. Keep services truly independent to isolate failures

  • Break systems down by behavior/responsibility rather than entities/nouns. Focus on independent functions rather than shared data models

  • Design for graceful degradation - disable non-critical features when dependencies fail rather than having the whole system fail

  • Test against live traffic - there’s no substitute for real-world verification. Have mechanisms to route small % of traffic to new versions

  • Use immutable logs/append-only data stores rather than mutable state to improve reliability and recovery

  • Control scope of failures through:

    • Service granularity
    • Circuit breakers
    • Request limiting
    • Graceful degradation
  • Deployment should be incremental - avoid “big bang” deployments by gradually routing traffic

  • Reliability requires holistic thinking across:

    • Architecture
    • Operations
    • Monitoring
    • Testing
    • Deployment
    • Data management
  • Redundancy means accepting more individual failures but gaining more ways to handle those failures

  • Simple health checks (status endpoints) can be very effective for monitoring service health