"KalDB: A cloud native log search platform" by Suman Karumuri (Strange Loop 2022)

Suman Karumuri, architect on Slack's observability team, presents KalDB, a cloud-native log search platform used to manage a petabyte of log data, highlighting its architecture, features, and scalability.

Key takeaways
  • KalDB manages a petabyte of log data with a 7-day retention period at Slack.
  • Slack’s use cases involve full-text search and older logs are indexed eventually.
  • Lucene is a feasible storage engine for log data.
  • The indexing process can be optimized by storing older logs in S3 and using tied storage.
  • The common fields in log messages can be extracted into key-value pairs.
  • Schema-less data allows for easier data management and query efficiency.
  • CalDB prioritizes indexing fresh logs over older logs.
  • Using cache nodes allows for faster query responses and better hardware utilization.
  • At scale, logs can be categorized into four types: high operational overhead, delayed logs, noisy neighbors, and field conflicts.
  • The cluster manager assigns tasks to recovery indexers and manages data life cycles.
  • Metadata stores are crucial for efficient data retrieval.
  • Using S3 as a deep store for logs reduces storage costs.
  • CalDB’s architecture allows for elastic scalability and Kubernetes native integration.
  • The system employs cache nodes that download segments from S3 and serve queries.
  • Queries typically revolve around last-day data, making it essential to have efficient query execution.
  • Duplicate information in logs and traces can be reduced by using aggregation support and ES-compatible APIs.
  • Fauna and CalDB can be used to overcome field conflicts.
  • Suman Karumuri is an architect on the observability team at Slack, building and running petabyte-scale systems.