Rareş Muşină – Resilient service-to-service calls in a post-Hystrix world

Automation

Rares Musina shares insights on resilient service-to-service calls in a post-Hystrix world, exploring alternatives like Resilience4j, Sentry, gRPC, and Envoy, emphasizing observability, automation, and discipline in achieving reliability and adaptability.

Key takeaways

Rares Musina’s presentation on resilient service-to-service calls in a post-Hystrix world.
Historically, service providers had resorted to Hystrix circuit breakers, but due to its limitations, users are now seeking alternatives.
Resilience4j is a new alternative, which provides a more straightforward, easy-to-use API, and is language-agnostic.
Sentry is a service that helps prevent and detect common errors, such as node connection failures, and provides metrics to aid in troubleshooting.
Netflix uses gRPC to build resilience features, but it’s still in its early stages.
One of the challenges with resilience is dealing with service providers throttling traffic, which can lead to unhappy users.
Observability is crucial for understanding the performance of a service and identifying areas for improvement.
To deal with sudden spikes in traffic, services need to be designed to handle bursts of requests, and not just average traffic.
Resilience requires discipline in setting timeouts and falback strategies.
Multiple teams may be involved in ensuring resilience, including DevOps, SRE, and backend teams.
Envoy is a service proxy that can be used to enforce resilience features, such as timeouts and circuit breakers.
Automation and observability are key to ensuring resilience in distributed systems.
When designing resilience, consider the use of idempotency to ensure correct behavior in the event of failures.
Capacity planning is essential to ensure that services can handle sudden spikes in traffic.
Below a certain threshold, services may not be able to handle additional requests, leading to resource starvation.
When designing services, consider the use of retries, but be aware of the potential for retry storms.
In the event of failures, services should aim to return a valid response to users, rather than simply failing.
Resilience is not a one-time effort, but rather an ongoing process that requires continuous monitoring and improvement.

Rareş Muşină – Resilient service-to-service calls in a post-Hystrix world

More talks