We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Failure & Change: Principles of Reliable Systems • Mark Hibberd • YOW! 2018
Learn principles for building reliable systems with Mark Hibberd: timeouts, service independence, graceful degradation, immutable logs, and testing against live traffic.
-
Being reliable means consistently performing well - serving some requests is better than serving none when under stress
-
Timeouts are critical for reliability - they prevent cascading failures and help systems recover. Need aggressive timeouts with exponential backoff
-
Service independence is key - avoid coupling through shared data stores, health checks, or deployment processes. Keep services truly independent to isolate failures
-
Break systems down by behavior/responsibility rather than entities/nouns. Focus on independent functions rather than shared data models
-
Design for graceful degradation - disable non-critical features when dependencies fail rather than having the whole system fail
-
Test against live traffic - there’s no substitute for real-world verification. Have mechanisms to route small % of traffic to new versions
-
Use immutable logs/append-only data stores rather than mutable state to improve reliability and recovery
-
Control scope of failures through:
- Service granularity
- Circuit breakers
- Request limiting
- Graceful degradation
-
Deployment should be incremental - avoid “big bang” deployments by gradually routing traffic
-
Reliability requires holistic thinking across:
- Architecture
- Operations
- Monitoring
- Testing
- Deployment
- Data management
-
Redundancy means accepting more individual failures but gaining more ways to handle those failures
-
Simple health checks (status endpoints) can be very effective for monitoring service health