Devoxx Greece 2024 - Kubernetes Resiliency by Chris Ayers

Learn how to design resilient and observable Kubernetes applications by setting baselines for resource requests, leveraging availability zones, and implementing monitoring, observability, and backup strategies.

Key takeaways

Creating a Baseline for Kubernetes

  • Set a baseline for resource requests and limits in Kubernetes applications
  • Requests should be based on average usage, not minimum

Resiliency and Availability

  • Availability zones are critical for resiliency in Kubernetes
  • Use availability zones to spread workloads across multiple regions
  • Haikus can be used to monitor availability

Monitoring and Observability

  • Use metrics like CPU usage, memory usage, and queue lengths to monitor applications
  • Leverage tools like Open Telemetry and distributed tracing for observability

Node and Resource Management

  • Use node pools and resource requests to manage compute resources
  • Limitations are crucial for resource management
  • Use feature flags to manage rollout of new features and versions

Scaling and Autoscaling

  • Use horizontal pod autoscalers to scale applications based on demand
  • Leverage pod disruption budgets to handle scaling and autoscaling

Failure Domains and Rollback

  • Identify failure domains in Kubernetes applications
  • Use probes like liveness, readiness, and startup probes to detect failures
  • Roll back deployments when failures occur

Backup and Disaster Recovery

  • Plan for backup and disaster recovery in Kubernetes applications
  • Use tools like Chaos Mesh for testing and validation

Monitoring and Testing

  • Monitor applications and nodes in Kubernetes
  • Load test applications to ensure they can handle demand
  • Validate test results using metrics and logging