We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Lies, Damned Lies & Timeouts Engineering • Yao Yue • YOW! 2017
Learn why timeouts & retries in distributed systems are more complex than they seem, how to measure & configure them properly, and strategies for graceful degradation.
-
Timeouts do not equal causality - they are merely approximations and don’t reliably indicate the actual health or state of remote services
-
Modern services run on deep, complex stacks (JVM, containers, OS, hardware) with many hidden dependencies and potential failure points that are often invisible to developers
-
Retry logic can actually make system problems worse by creating positive feedback loops and amplifying load on already stressed services
-
Multi-tenant environments (shared hosts/containers) make reliability more challenging due to resource contention and interference between services
-
Configure timeouts and retries based on actual system behavior and measurements, not just intuition:
- Test common failure modes frequently
- Monitor and trace request lifecycles
- Consider baseline performance characteristics
- Account for GC pauses and system interruptions
-
Apply back pressure mechanisms when possible and fail fast rather than retrying indefinitely
-
Global failure prevention is better than local - coordinate across services rather than having each make independent decisions
-
Have clear budgets for retries and timeouts, especially at the top of your dependency chain
-
Test catastrophic failure scenarios, not just common cases or ideal conditions
-
Perfect reliability is impossible in distributed systems - design for graceful degradation and acceptable failure modes