"Hodor: Detecting and Addressing Overload in LinkedIn Microservices" by Bryan Barkley

Detecting and addressing overload in LinkedIn's microservices with Hodor, a monitoring framework that detects overload early, gradually sheds traffic, and adapts to changing traffic patterns.

Key takeaways
  • Overload detection and remediation are crucial for microservices, as they can quickly become overwhelmed and lead to cascading failures.
  • Hodor is a monitoring framework developed by LinkedIn to detect and address overload in microservices.
  • Design principles of Hodor include detecting overload early, conservatively signaling overload, and shedding traffic progressively.
  • Hodor has three main components: overload detectors, load shedding strategy, and data analysis.
  • Overload detectors include heartbeat, garbage collection, and thread pool detectors, which monitor specific metrics to detect overload.
  • Load shedding strategy involves gradually shedding traffic to prevent cascading failures and prevent retry storms.
  • Data analysis involves collecting and analyzing metrics to refine overload detection and improve load shedding strategies.
  • Hodor has been deployed to close to a thousand services in production, with no measurable overhead.
  • Hodor is designed to be extensible and modular, allowing for easy addition of new detectors and integration with existing systems.
  • The framework is also designed to be self-healing, allowing it to adapt to changing traffic patterns and service behavior.
  • Future plans for Hodor include adding additional detectors and improving data analysis capabilities.