Using Go to Scale Audit Logging at Cloudflare - Arti Phugat, Cloudflare

Learn how Cloudflare scaled their audit logging from 30 to 3,000 messages/second using Go, Kafka optimizations, and smart architectural choices for high performance.

Key takeaways
  • Audit logs track changes to system configuration, recording who made changes, what was changed, when it occurred, and through which interface (API/UI)

  • Cloudflare scaled their audit logging system from 30-35 messages/second to 2,500-3,000 messages/second by:

    • Using Go routines for concurrent processing
    • Implementing batch processing of messages
    • Horizontally scaling Kafka consumers
    • Caching internal service responses
  • Key Kafka consumer optimizations:

    • Using consumer groups instead of single consumers
    • Setting appropriate batch sizes (500 in their case)
    • Configuring optimal session timeouts (20 seconds)
    • Running multiple consumer pods
    • Matching partition count to expected throughput
  • System bottlenecks were identified and resolved through:

    • CPU and memory profiling
    • Metrics collection and visualization with Grafana
    • Monitoring database latency
    • Tracking consumer lag
  • Performance improvements implemented:

    • Batch database insertions instead of individual queries
    • Parallel request transformation using Go routines
    • Redis caching for internal service responses
    • Horizontal scaling of application pods in Kubernetes
  • Go was chosen for its:

    • Strong concurrency support via goroutines and channels
    • Extensive standard library
    • High performance characteristics
    • Easy learning curve
  • Architecture decisions included:

    • Event-driven design using Kafka
    • Kubernetes for container orchestration
    • Multiple service replicas for high availability
    • Decoupled components for better scalability