Lies, Damned Lies & Timeouts Engineering • Yao Yue • YOW! 2017

Learn why timeouts & retries in distributed systems are more complex than they seem, how to measure & configure them properly, and strategies for graceful degradation.

Key takeaways
  • Timeouts do not equal causality - they are merely approximations and don’t reliably indicate the actual health or state of remote services

  • Modern services run on deep, complex stacks (JVM, containers, OS, hardware) with many hidden dependencies and potential failure points that are often invisible to developers

  • Retry logic can actually make system problems worse by creating positive feedback loops and amplifying load on already stressed services

  • Multi-tenant environments (shared hosts/containers) make reliability more challenging due to resource contention and interference between services

  • Configure timeouts and retries based on actual system behavior and measurements, not just intuition:

    • Test common failure modes frequently
    • Monitor and trace request lifecycles
    • Consider baseline performance characteristics
    • Account for GC pauses and system interruptions
  • Apply back pressure mechanisms when possible and fail fast rather than retrying indefinitely

  • Global failure prevention is better than local - coordinate across services rather than having each make independent decisions

  • Have clear budgets for retries and timeouts, especially at the top of your dependency chain

  • Test catastrophic failure scenarios, not just common cases or ideal conditions

  • Perfect reliability is impossible in distributed systems - design for graceful degradation and acceptable failure modes