Failure & Change: Principles of Reliable Systems • Mark Hibberd • YOW! 2018

Testing

Learn principles for building reliable systems with Mark Hibberd: timeouts, service independence, graceful degradation, immutable logs, and testing against live traffic.

Key takeaways

Being reliable means consistently performing well - serving some requests is better than serving none when under stress
Timeouts are critical for reliability - they prevent cascading failures and help systems recover. Need aggressive timeouts with exponential backoff
Service independence is key - avoid coupling through shared data stores, health checks, or deployment processes. Keep services truly independent to isolate failures
Break systems down by behavior/responsibility rather than entities/nouns. Focus on independent functions rather than shared data models
Design for graceful degradation - disable non-critical features when dependencies fail rather than having the whole system fail
Test against live traffic - there’s no substitute for real-world verification. Have mechanisms to route small % of traffic to new versions
Use immutable logs/append-only data stores rather than mutable state to improve reliability and recovery
Control scope of failures through:
- Service granularity
- Circuit breakers
- Request limiting
- Graceful degradation
Deployment should be incremental - avoid “big bang” deployments by gradually routing traffic
Reliability requires holistic thinking across:
- Architecture
- Operations
- Monitoring
- Testing
- Deployment
- Data management
Redundancy means accepting more individual failures but gaining more ways to handle those failures
Simple health checks (status endpoints) can be very effective for monitoring service health

Failure & Change: Principles of Reliable Systems • Mark Hibberd • YOW! 2018

More talks