Cultivating Production Excellence • Liz Fong-Jones • YOW! 2019

Cultivate production excellence beyond building systems.

Key takeaways
  • It’s not just about building the system, but also about making it work and reliable for users.
  • Redundancy and failover are essential, but don’t assume everything will always work as expected.
  • Obscurity and noise in alerting systems can be detrimental, and it’s important to have a clear understanding of what constitutes an emergency.
  • The concept of service level objectives (SLOs) should be revisited, and we need to measure availability and reliability in a more meaningful way.
  • We should not prioritize over-documenting, but focus on understanding critical user journeys and what affects their experience.
  • Complexity should be addressed through ergonomic instrumentation paths, efficient data storage, and collaboration.
  • Teams need to communicate effectively and have shared views on data to make decisions.
  • Engineers should be empowered to make decisions and ask questions, and should be valued and rewarded for their contributions.
  • Culture and processes play a significant role in making systems reliable and friendly.
  • Measuring what’s important, such as user satisfaction and experience, is crucial.
  • It’s important to iterate on our approach, testing and refining our methods, rather than sticking to a single framework or tool.
  • We should prioritize building up the skills and abilities of individuals and teams, rather than relying on tools alone.
  • The concept of a “blameless postmortem” is important, as it encourages learning from failures and improves our understanding of the system.
  • Collaboration and communication are key to addressing outages and incidents, and we should focus on empowering individuals to make decisions.
  • The frequency and severity of outages can have a significant impact on users, and we should prioritize reducing the impact of these incidents.
  • Observability is essential, and we should invest in tools and culture to enable this.
  • We should not be afraid to ask questions or challenge assumptions, and should prioritize the well-being and satisfaction of our users.
  • It’s important to recognize that the human element is essential in making systems reliable, and we should prioritize building up the skills and abilities of individuals and teams.