We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Sh... Fail Happens: Fail-aware Events Processing on AWS • Marcin Sodkiewicz • GOTO 2024
Learn essential patterns for resilient event processing on AWS: dead letter queues, batch processing, circuit breakers, proper monitoring, and handling failures at scale.
- Always set up dead letter queues (DLQ) for event-driven architectures - they should not be optional
- Implement proper failure handling with bisection for batch processing - split batches in half on failure instead of failing entire batch
- Use circuit breakers for handling third-party integration failures - can be implemented using CloudWatch alarms and event source mapping controls
- Add retention policies and monitor message age to prevent data loss - default 24h retention can be extended up to 1 year
- Implement exponential backoff with jitter for retries to prevent thundering herd problems
- Consider cost implications when choosing between SQS and Kinesis - Kinesis becomes more cost-effective at high throughput
- Use distributed tracing (OpenTelemetry/W3C standard) to track message flow and failures across system
-
Set up proper monitoring and alerts:
- Dead letter queue metrics
- Processing lag
- Throttling events
- Message retention age
- Ensure messages are idempotent and properly versioned
- Assign clear ownership for dead letter queues and failure handling processes
- Consider using Kinesis Data Firehose for failure auditing and analysis
- Implement partial failure handling to avoid reprocessing successful items in batches
- Test failure scenarios using chaos engineering tools like AWS Fault Injection Simulator