A Field Guide to Reliability Engineering at Zalando • Heinrich Hartmann • GOTO 2024

Learn how Zalando builds reliable systems through user-focused observability, automated SLOs, tight feedback loops, standardized monitoring, and a blameless culture.

Key takeaways
  • Focus on user experience as the north star metric - alerts and monitoring should be based on user-facing symptoms rather than internal metrics

  • Build observability with distributed tracing as foundation - enables top-down understanding of reliability and rapid debugging

  • Implement tight feedback loops through:

    • Automated alerting based on SLOs
    • Incident management processes
    • Regular operational reviews
    • Postmortem analysis
  • Balance the reliability triangle between:

    • System reliability
    • Developer productivity
    • On-call health
  • Standardize observability through:

    • Common dashboards
    • Open Telemetry instrumentation
    • Golden signals (request errors, duration, saturation)
    • Service health metrics
  • Enable teams to operate autonomously while maintaining organizational visibility through:

    • Clear team ownership boundaries
    • “You build it, you run it” culture
    • Platform capabilities vs team responsibilities
  • Quantify reliability everywhere using:

    • SLOs tied to business metrics
    • Impact analysis of incidents
    • Risk management tracking
    • Regular reporting to leadership
  • Create blameless culture focused on:

    • Learning from incidents
    • System improvements
    • Cross-team knowledge sharing
    • Continuous reliability improvements
  • Treat reliability as a socio-technical system requiring:

    • Process and people considerations
    • Technology solutions
    • Management support
    • Organizational alignment