Ramón Medrano Llamas – Oops. I broke the Google, now what?

Service Reliability Engineers, learn how to develop and run production services, prevent outages, and improve service availability and customer satisfaction through techniques like automation, logging, code reviews, and more.

Key takeaways
  • In software development, measuring and understanding service health is crucial. This can be achieved through Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
  • Monitoring tools can be used to detect anomalies, but postmortems are necessary to understand the root cause of outages.
  • Implementing automation for quota management can help prevent overloading and outages.
  • Service Reliability Engineers (SREs) are responsible for developing and running production services.
  • Reducing toil, making changes to reduce the time spent on maintenance tasks, can improve service availability and customer satisfaction.
  • Providing training and guidance on writing effective postmortems can help teams learn from failures and improve their services.
  • Detecting and responding to issues quickly is important, and tools like exponential back-off and logging can help with this.
  • Code reviews, unit tests, and automation can help reduce bugs and improve service reliability.
  • Having multiple replicas of services can help with redundancy and availability.
  • Failures are inevitable, and a blameless postmortem culture is important for learning from failures and improving services.
  • Monitoring and logging can help identify and diagnose issues.
  • Predictive maintenance can help prevent outages by detecting anomalies and issues before they become problems.
  • SREs should be responsible for the development and running of production services.
  • Code coverage and automated testing can help ensure reliability.
  • There is no single solution for preventing all outages, but a combination of techniques and practices can help reduce their frequency and impact.
  • Monitoring tools can help identify and alert operators to issues, but human judgment is still necessary to interpret data and make decisions.
  • Training and guidance on writing effective postmortems can help teams learn from failures and improve their services.
  • The goal of SLOs is to define what acceptable service levels are and measure against those.