We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
A Field Guide to Reliability Engineering at Zalando • Heinrich Hartmann • GOTO 2024
Learn how Zalando builds reliable systems through user-focused observability, automated SLOs, tight feedback loops, standardized monitoring, and a blameless culture.
-
Focus on user experience as the north star metric - alerts and monitoring should be based on user-facing symptoms rather than internal metrics
-
Build observability with distributed tracing as foundation - enables top-down understanding of reliability and rapid debugging
-
Implement tight feedback loops through:
- Automated alerting based on SLOs
- Incident management processes
- Regular operational reviews
- Postmortem analysis
-
Balance the reliability triangle between:
- System reliability
- Developer productivity
- On-call health
-
Standardize observability through:
- Common dashboards
- Open Telemetry instrumentation
- Golden signals (request errors, duration, saturation)
- Service health metrics
-
Enable teams to operate autonomously while maintaining organizational visibility through:
- Clear team ownership boundaries
- “You build it, you run it” culture
- Platform capabilities vs team responsibilities
-
Quantify reliability everywhere using:
- SLOs tied to business metrics
- Impact analysis of incidents
- Risk management tracking
- Regular reporting to leadership
-
Create blameless culture focused on:
- Learning from incidents
- System improvements
- Cross-team knowledge sharing
- Continuous reliability improvements
-
Treat reliability as a socio-technical system requiring:
- Process and people considerations
- Technology solutions
- Management support
- Organizational alignment