Tanat Lokejaroenlarb - Observability to Resolution: The Journey Through a Production K8s Incident

Join Tanat Lokejaroenlarb as they share their Kubernetes incident journey, highlighting key takeaways on identifying issues, driving resolution, and incident management.

Key takeaways

Keep a solid incident management process in place to ensure consistent communication and investigation
Use service level indicators (SLIs) to track performance and identify issues early
Don’t jump to conclusions, keep investigating and ruling out theories
Keep a focused mindset, avoid distractions, and prioritize corrective action
Use data and metrics to inform decisions and drive resolution
Maintain transparency and communication with stakeholders throughout the incident
Create a dashboard to visualize data and track progress
Continuously review and improve incident management processes
Prioritize observability and monitoring to detect issues early
Consider using a managed solution for DNS resolution
Keep in mind that complexity can be a factor in issue resolution
Use a systematic approach to investigation, and don’t be afraid to pivot when necessary
Make incremental improvements to fix issues, rather than trying to solve everything at once
Use alerts to notify teams of issues and ensure prompt action
Keep documentation of incidents and lessons learned for future reference

Tanat Lokejaroenlarb - Observability to Resolution: The Journey Through a Production K8s Incident

More talks