Lies, Damned Lies & Timeouts Engineering • Yao Yue • YOW! 2017

Testing

Learn why timeouts & retries in distributed systems are more complex than they seem, how to measure & configure them properly, and strategies for graceful degradation.

Key takeaways

Timeouts do not equal causality - they are merely approximations and don’t reliably indicate the actual health or state of remote services
Modern services run on deep, complex stacks (JVM, containers, OS, hardware) with many hidden dependencies and potential failure points that are often invisible to developers
Retry logic can actually make system problems worse by creating positive feedback loops and amplifying load on already stressed services
Multi-tenant environments (shared hosts/containers) make reliability more challenging due to resource contention and interference between services
Configure timeouts and retries based on actual system behavior and measurements, not just intuition:
- Test common failure modes frequently
- Monitor and trace request lifecycles
- Consider baseline performance characteristics
- Account for GC pauses and system interruptions
Apply back pressure mechanisms when possible and fail fast rather than retrying indefinitely
Global failure prevention is better than local - coordinate across services rather than having each make independent decisions
Have clear budgets for retries and timeouts, especially at the top of your dependency chain
Test catastrophic failure scenarios, not just common cases or ideal conditions
Perfect reliability is impossible in distributed systems - design for graceful degradation and acceptable failure modes

Lies, Damned Lies & Timeouts Engineering • Yao Yue • YOW! 2017

More talks