Tanat Lokejaroenlarb - Observability to Resolution: The Journey Through a Production K8s Incident

Join Tanat Lokejaroenlarb as they share their Kubernetes incident journey, highlighting key takeaways on identifying issues, driving resolution, and incident management.

Key takeaways
  • Keep a solid incident management process in place to ensure consistent communication and investigation
  • Use service level indicators (SLIs) to track performance and identify issues early
  • Don’t jump to conclusions, keep investigating and ruling out theories
  • Keep a focused mindset, avoid distractions, and prioritize corrective action
  • Use data and metrics to inform decisions and drive resolution
  • Maintain transparency and communication with stakeholders throughout the incident
  • Create a dashboard to visualize data and track progress
  • Continuously review and improve incident management processes
  • Prioritize observability and monitoring to detect issues early
  • Consider using a managed solution for DNS resolution
  • Keep in mind that complexity can be a factor in issue resolution
  • Use a systematic approach to investigation, and don’t be afraid to pivot when necessary
  • Make incremental improvements to fix issues, rather than trying to solve everything at once
  • Use alerts to notify teams of issues and ensure prompt action
  • Keep documentation of incidents and lessons learned for future reference