We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Tanat Lokejaroenlarb - Observability to Resolution: The Journey Through a Production K8s Incident
Join Tanat Lokejaroenlarb as they share their Kubernetes incident journey, highlighting key takeaways on identifying issues, driving resolution, and incident management.
- Keep a solid incident management process in place to ensure consistent communication and investigation
- Use service level indicators (SLIs) to track performance and identify issues early
- Don’t jump to conclusions, keep investigating and ruling out theories
- Keep a focused mindset, avoid distractions, and prioritize corrective action
- Use data and metrics to inform decisions and drive resolution
- Maintain transparency and communication with stakeholders throughout the incident
- Create a dashboard to visualize data and track progress
- Continuously review and improve incident management processes
- Prioritize observability and monitoring to detect issues early
- Consider using a managed solution for DNS resolution
- Keep in mind that complexity can be a factor in issue resolution
- Use a systematic approach to investigation, and don’t be afraid to pivot when necessary
- Make incremental improvements to fix issues, rather than trying to solve everything at once
- Use alerts to notify teams of issues and ensure prompt action
- Keep documentation of incidents and lessons learned for future reference