We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Tanat Lokejaroenlarb - Observability to Resolution: The Journey Through a Production K8s Incident
Join Tanat Lokejaroenlarb as he shares his journey through a production Kubernetes incident, highlighting the importance of observability, incident management, and data-driven decision-making in resolving complex issues.
- Obscurity is a problem: Issues can be masked by complexity, and the more obscure the problem, the more difficult it is to identify.
- Incident management is crucial: Establishing a shared document or incident report can help team members stay on the same page and reduce the risk of miscommunication.
- Don’t ignore indicators: Good indicators can be ignored, and it’s essential to take action on them to prevent further issues.
- Learn from failures: Incidents can be valuable learning experiences, and it’s essential to reflect on what went wrong and how to improve.
- Focus on delivery: In the midst of an incident, it’s easy to get distracted by secondary issues, but it’s crucial to stay focused on delivering a solution.
- Use data to drive decision-making: Data can be a powerful tool in incident resolution, and it’s essential to rely on facts rather than intuition.
- Don’t give up: Incidents can be complex and challenging, but it’s essential to persevere and not give up.
- SLIs are important: Service-level indicators (SLIs) can help identify issues and prevent them from becoming major problems.
- Don’t blame others: Incidents can be stressful, but it’s essential to avoid blame and focus on finding a solution.
- Use a data-driven approach: Relying on data and evidence can help identify the root cause of an issue and prevent similar problems in the future.
- Don’t assume you know the answer: Incidents can be complex, and it’s essential to avoid assumptions and instead rely on data and evidence to drive decision-making.
- Use automation to simplify issues: Automation can help simplify complex issues and reduce the risk of human error.
- Use metrics to monitor performance: Metrics can help monitor performance and identify issues before they become major problems.
- Focus on delivering value: In the midst of an incident, it’s essential to focus on delivering value to customers and stakeholders.