Rareş Muşină – Resilient service-to-service calls in a post-Hystrix world

Rares Musina shares insights on resilient service-to-service calls in a post-Hystrix world, exploring alternatives like Resilience4j, Sentry, gRPC, and Envoy, emphasizing observability, automation, and discipline in achieving reliability and adaptability.

Key takeaways
  • Rares Musina’s presentation on resilient service-to-service calls in a post-Hystrix world.
  • Historically, service providers had resorted to Hystrix circuit breakers, but due to its limitations, users are now seeking alternatives.
  • Resilience4j is a new alternative, which provides a more straightforward, easy-to-use API, and is language-agnostic.
  • Sentry is a service that helps prevent and detect common errors, such as node connection failures, and provides metrics to aid in troubleshooting.
  • Netflix uses gRPC to build resilience features, but it’s still in its early stages.
  • One of the challenges with resilience is dealing with service providers throttling traffic, which can lead to unhappy users.
  • Observability is crucial for understanding the performance of a service and identifying areas for improvement.
  • To deal with sudden spikes in traffic, services need to be designed to handle bursts of requests, and not just average traffic.
  • Resilience requires discipline in setting timeouts and falback strategies.
  • Multiple teams may be involved in ensuring resilience, including DevOps, SRE, and backend teams.
  • Envoy is a service proxy that can be used to enforce resilience features, such as timeouts and circuit breakers.
  • Automation and observability are key to ensuring resilience in distributed systems.
  • When designing resilience, consider the use of idempotency to ensure correct behavior in the event of failures.
  • Capacity planning is essential to ensure that services can handle sudden spikes in traffic.
  • Below a certain threshold, services may not be able to handle additional requests, leading to resource starvation.
  • When designing services, consider the use of retries, but be aware of the potential for retry storms.
  • In the event of failures, services should aim to return a valid response to users, rather than simply failing.
  • Resilience is not a one-time effort, but rather an ongoing process that requires continuous monitoring and improvement.