Sh... Fail Happens: Fail-aware Events Processing on AWS • Marcin Sodkiewicz • GOTO 2024

Learn essential patterns for resilient event processing on AWS: dead letter queues, batch processing, circuit breakers, proper monitoring, and handling failures at scale.

Key takeaways
  • Always set up dead letter queues (DLQ) for event-driven architectures - they should not be optional
  • Implement proper failure handling with bisection for batch processing - split batches in half on failure instead of failing entire batch
  • Use circuit breakers for handling third-party integration failures - can be implemented using CloudWatch alarms and event source mapping controls
  • Add retention policies and monitor message age to prevent data loss - default 24h retention can be extended up to 1 year
  • Implement exponential backoff with jitter for retries to prevent thundering herd problems
  • Consider cost implications when choosing between SQS and Kinesis - Kinesis becomes more cost-effective at high throughput
  • Use distributed tracing (OpenTelemetry/W3C standard) to track message flow and failures across system
  • Set up proper monitoring and alerts:
    • Dead letter queue metrics
    • Processing lag
    • Throttling events
    • Message retention age
  • Ensure messages are idempotent and properly versioned
  • Assign clear ownership for dead letter queues and failure handling processes
  • Consider using Kinesis Data Firehose for failure auditing and analysis
  • Implement partial failure handling to avoid reprocessing successful items in batches
  • Test failure scenarios using chaos engineering tools like AWS Fault Injection Simulator