A Field Guide to Reliability Engineering at Zalando • Heinrich Hartmann • GOTO 2024

Automation Devops

Learn how Zalando builds reliable systems through user-focused observability, automated SLOs, tight feedback loops, standardized monitoring, and a blameless culture.

Key takeaways

Focus on user experience as the north star metric - alerts and monitoring should be based on user-facing symptoms rather than internal metrics
Build observability with distributed tracing as foundation - enables top-down understanding of reliability and rapid debugging
Implement tight feedback loops through:
- Automated alerting based on SLOs
- Incident management processes
- Regular operational reviews
- Postmortem analysis
Balance the reliability triangle between:
- System reliability
- Developer productivity
- On-call health
Standardize observability through:
- Common dashboards
- Open Telemetry instrumentation
- Golden signals (request errors, duration, saturation)
- Service health metrics
Enable teams to operate autonomously while maintaining organizational visibility through:
- Clear team ownership boundaries
- “You build it, you run it” culture
- Platform capabilities vs team responsibilities
Quantify reliability everywhere using:
- SLOs tied to business metrics
- Impact analysis of incidents
- Risk management tracking
- Regular reporting to leadership
Create blameless culture focused on:
- Learning from incidents
- System improvements
- Cross-team knowledge sharing
- Continuous reliability improvements
Treat reliability as a socio-technical system requiring:
- Process and people considerations
- Technology solutions
- Management support
- Organizational alignment

A Field Guide to Reliability Engineering at Zalando • Heinrich Hartmann • GOTO 2024

More talks