Sh... Fail Happens: Fail-aware Events Processing on AWS • Marcin Sodkiewicz • GOTO 2024

Learn essential patterns for resilient event processing on AWS: dead letter queues, batch processing, circuit breakers, proper monitoring, and handling failures at scale.

Key takeaways

Always set up dead letter queues (DLQ) for event-driven architectures - they should not be optional
Implement proper failure handling with bisection for batch processing - split batches in half on failure instead of failing entire batch
Use circuit breakers for handling third-party integration failures - can be implemented using CloudWatch alarms and event source mapping controls
Add retention policies and monitor message age to prevent data loss - default 24h retention can be extended up to 1 year
Implement exponential backoff with jitter for retries to prevent thundering herd problems
Consider cost implications when choosing between SQS and Kinesis - Kinesis becomes more cost-effective at high throughput
Use distributed tracing (OpenTelemetry/W3C standard) to track message flow and failures across system
Set up proper monitoring and alerts:
- Dead letter queue metrics
- Processing lag
- Throttling events
- Message retention age
Ensure messages are idempotent and properly versioned
Assign clear ownership for dead letter queues and failure handling processes
Consider using Kinesis Data Firehose for failure auditing and analysis
Implement partial failure handling to avoid reprocessing successful items in batches
Test failure scenarios using chaos engineering tools like AWS Fault Injection Simulator

Sh... Fail Happens: Fail-aware Events Processing on AWS • Marcin Sodkiewicz • GOTO 2024

More talks