Ara Pulido – Kubernetes at Datadog Scale

Datadog's journey to scale Kubernetes to over 1,000 nodes across multiple clouds, including their approach to networking, policy enforcement, and developer experience.

Key takeaways
  • Datadog has over a thousand nodes per cluster, using multiple clouds.
  • They started migrating to Kubernetes, initially using IPv6 and Selium for networking.
  • They chose Selium over Istio for host-to-host encryption due to its simplicity and flexibility.
  • Datadog’s engineers contributed to Selium, making it a more reliable choice.
  • They implemented vertical pod scaler to reduce nodes and save costs.
  • Direct port routing was chosen over IPVS or IP tables for pod networking.
  • They used Rego code for policy enforcement with gatekeeper, allowing them to validate and mutate requests.
  • Datadog’s approach to policy enforcement is to expose as much Kubernetes as possible to developers.
  • They prioritize developer experience, using APIs to make Kubernetes more accessible.
  • They believe in the importance of extending Kubernetes through custom resource definitions.
  • Datadog uses Kubernetes as a platform to build its platform, focusing on simplicity and extensibility.
  • They are committed to making Kubernetes API-driven, using APIs to enable automation tools.
  • Gatekeeper is a super easy-to-use policy enforcement tool, making it suitable for new users.
  • Datadog’s journey with Kubernetes has been a six-year-long process of learning and adaptation.
  • They emphasize the importance of understanding Kubernetes internals to build a scalable and maintainable system.
  • Datadog’s success with Kubernetes is attributed to its adoption of managed Kubernetes, as well as its own engineering efforts.
  • They are hiring engineers who are experienced in Kubernetes and container networking.