Tanat Lokejaroenlarb - Observability to Resolution: The Journey Through a Production K8s Incident

Join Tanat Lokejaroenlarb as he shares his journey through a production Kubernetes incident, highlighting the importance of observability, incident management, and data-driven decision-making in resolving complex issues.

Key takeaways
  • Obscurity is a problem: Issues can be masked by complexity, and the more obscure the problem, the more difficult it is to identify.
  • Incident management is crucial: Establishing a shared document or incident report can help team members stay on the same page and reduce the risk of miscommunication.
  • Don’t ignore indicators: Good indicators can be ignored, and it’s essential to take action on them to prevent further issues.
  • Learn from failures: Incidents can be valuable learning experiences, and it’s essential to reflect on what went wrong and how to improve.
  • Focus on delivery: In the midst of an incident, it’s easy to get distracted by secondary issues, but it’s crucial to stay focused on delivering a solution.
  • Use data to drive decision-making: Data can be a powerful tool in incident resolution, and it’s essential to rely on facts rather than intuition.
  • Don’t give up: Incidents can be complex and challenging, but it’s essential to persevere and not give up.
  • SLIs are important: Service-level indicators (SLIs) can help identify issues and prevent them from becoming major problems.
  • Don’t blame others: Incidents can be stressful, but it’s essential to avoid blame and focus on finding a solution.
  • Use a data-driven approach: Relying on data and evidence can help identify the root cause of an issue and prevent similar problems in the future.
  • Don’t assume you know the answer: Incidents can be complex, and it’s essential to avoid assumptions and instead rely on data and evidence to drive decision-making.
  • Use automation to simplify issues: Automation can help simplify complex issues and reduce the risk of human error.
  • Use metrics to monitor performance: Metrics can help monitor performance and identify issues before they become major problems.
  • Focus on delivering value: In the midst of an incident, it’s essential to focus on delivering value to customers and stakeholders.