As many services are working together as one large seamless system, we must assume that communication between those services will not be perfect all the time. Failure will occur at some point, as one of the services in a chain of service calls, can become unavailable, causing the whole system to behave unpredictably. Luckily, there are several patterns to handle failures in service communication and provide service resiliancy - and the best part, it’s all provided to us nicely packaged as the Resilience4j library.
In short, “Resilience4j is a lightweight fault tolerance library inspired by Netflix Hystrix”. Since Netflix Hystrix is not being developed anymore, Resilience4j is recommended as a replacement, even by the Netflix Hystrix team. Resilience4j is one of the few libraries officially supported by Spring Cloud Circuit Breaker.
What we will build
In this article, we will fully configure service-to-service communication, making our service/application always behave as expected, even in cases when the services it is trying to reach are behaving unpredictably - they may be fully down, or currently unavailable, or randomly failing to respond.
After reading this article, you will be able to understand solutions that exist, what they are trying to solve exactly, and to copy and paste a few configurations into your project and have your service fully set up to handle all unpredictable behaviors.
Service resiliency 101
Scenario
Let’s say we want to know the current rating score of a movie, but not only from imdb.com, but several popular websites, all at once. To do that, we are going to build a service that fetches movie rating scores from a few services (imdb.com, rottentomatoes.com, metacritic.com, etc.) and returns all gathered results as a single response, along with the calculated combined rating score. So, expected response from our service would look like this:“
*note: we are assuming that our service has all needed ‘external ids’ for the movies we support, in order to fetch rating score from each website.
Problem
As assumed, our service will communicate with many external services via http calls in order to gather and provide all information in a single request-response call made to our service. Logically, there are few questions in a given situation:
- What if any of the APIs we are using stops working? Will our service work? What if imdb.com suddenly returns 503, but rottentomatoes.com returns 200, does our service return anything in that case? Do we just wrap every call into try/catch?
- Do we keep bombarding API that is currently unresponsive, hoping it will start working soon in order to give us a response that we can use, in a decent time?
- Or, should we wait some time, 5 seconds or 10 seconds or maybe 1 min, before trying to hit the unresponsive API again, meanwhile, still not returning anything to the user, as we are waiting for another service to respond?
- And most importantly, do we need to implement logic from scratch in order to support any of these cases, or is there an elegant way to have everything handled for us, with a minimal effort?
Solution
When depending on unpredictable services, and wanting to make sure our service is resilient to all possible problems that could occur, there are several solutions/patterns we can apply. We will focus on patterns that are implemented by Resilience4j library. Those are:
- Circuit breaker
- Rate limiter
- Time limiter
- Bulkhead
- Retry
In the following sections, we will explain and configure all of them.
Resilience4j
Step 1: Adding dependency to our Spring Boot project
We will add Resilience4j dependency into our Spring Boot project. Resiliance4j provides several versions of the library, and one of it is made specifically for Spring Boot projects - io.github.resilience4j:resilience4j-spring-boot2. The given dependency is basically a Spring Boot starter, so just by adding mentioned dependency into Spring Boot project, all of the needed Resiliance4j beans will be automatically autowired into Spring application context, without any need for custom bean definitions or custom bean scanning or manual importing of existing configurations. All we have to do is custom configure desired behaviors via application.yaml file. That's all!
Adding Resiliance4j dependency:
*note: resilience4j-spring-boot2 requires org.springframework.boot:spring-boot-starter-actuator and org.springframework.boot:spring-boot-starter-aop so we will add those as well.
Step 2: Configure custom behavior
1. Circuit breaker
Circuit breaker represents a connection point between parts of the system. With circuit breaker we can configure specific cases in which parts of the system will be ‘connected’ i.e., when services will be able to communicate with each other, when they will not, and for how long. All of that is achieved with the ‘state’ property of the circuit breaker.
Concepts of the Circuit breaker:
- Connection between services is defined with the state of a circuit breaker: OPEN, CLOSED or HALF_OPEN.
- If the service we are trying to reach is failing in the last N number of attempts (COUNT_BASED circuit breaker)or in the last N seconds (TIME_BASED circuit breaker), the state of the circuit breaker is switched to OPEN
- In OPEN state, all calls to that service will be disabled for a predefined time period and CallNotPermittedException is thrown
- During a predefined period of time, fallback logic will be executed on every call
- After predefined wait time is over, circuit breaker is in HALF_OPEN state - allowing only a certain number of requests to pass through in order for circuit breaker to determine liveness of the service
- Based on that small number of requests, the circuit breaker is changing its state to CLOSED or returning back to OPEN state again
- CLOSED state is the normal state, allowing normal functioning i.e., allowing calls to service
- For more, check out the official circuit breaker documentation
What we want to achieve:
- When an external service, in this case www.imdb.com, fails to respond successfully for more than 50% of calls in the 60 second window, or 50% of requests are slow, below 5 second, we will stop sending requests to it for 30 second, to allow it to recover
- After a given 30 second period, we will allow 5 calls to it, for determining if the service started working again, before fully allowing all requests to it again
- During the time when no requests are sent to external service, we will return previously cached data, until service starts working again
Hands on
Implementing something like this seems quite complex. Let's see how it can be achieved with Resilience4j. First, configuration.
After configuration, we are ready to apply @CircuitBreaker annotation to desired methods that are responsible for calling external API.
As it can be seen, @CircuitBreaker annotation has two properties:
- name - Referencing the name of the configuration we just defined in the application.yaml file
- fallbackMethod - Referencing fallback method name defined inside the same class. Fallback method is behaving just like try/catch and will be called when an exception is thrown. It has the same return type and parameters as the annotated method, but with one more mandatory parameter - an exception that is caught inside the annotated method. We can implement different behavior depending on the type of exception. In this example, we will just return the cached value and log exception details
We are all done here! Our complex logic for handling unpredictable API is achieved in a way we wanted, with just configuration and an annotation. Great stuff!
2. Rate limiter
Any service method can be a victim of too many calls in a small period of time. Too many calls to a method may overload the whole service. Luckily, there is an easy way to reject all method calls in case the specified threshold is reached. Rate limiting is setting that threshold - a maximum number of calls in a period of time.
Concepts of Rate limiter
- Ability to limit access to a method, to a specified number of calls in a defined time period
- In case threshold is reached, we can set max wait time before actually trying to call the method
- RequestNotPermitted is thrown and fallback method is executed in case threshold is still reached
What we want to achieve
Let's say our application is quite popular now. Many users are requesting a current rating score for a movie. Instead of overloading our server, or returning 429 to users, we will limit usage of our API to the desired number of calls per second.
Our configuration is:
- Limit our API usage to 10 calls per 1 second periods
- In case API usage is above the threshold in a given period, following calls will wait 3 seconds
- After waiting time, if API usage is still above the threshold, cached value will be returned
Hands on
We will start with configuration.
Applying annotation to desired method. Two properties, name and fallbackMethod, are present for @RateLimiter annotation as well.
3. Time limiter
Some parts of the system can take a lot of time to finish. In some cases, that can be an indicator that service is not behaving properly. For example, if fetching data from external services is taking a long time, maybe it’s better not to wait indefinitely, and just stop execution early. Time limiter is allowing us to set maximum execution time we are willing to wait when calling a method.
Concepts of Time limiter
- Setting maximum execution time of a called method
- In case method execution time exceeds predefined time, TimeoutException is thrown and fallback method is executed
- Method return type needs to be an implementation of Future
- It requires defining a thread pool from where threads are chosen from, since we are working with asynchronous operations by using Future
- Since we are working with Future, there are blocking and non-blocking (same as Springs @Async call) ways of using annotated methods, depending on our needs
What we want to achieve
- When our service is fetching rating score from www.rottentomatoes.com we want to limit the amount of time we are willing to wait until we get the response to 3 seconds
- In case predefined time is up, we will return the cached value instead
Hands on
Configuring Time limiter, along with required thread pool configuration.
*note: we are defining a thread pool using one type of Bulkhead, which we will check in the following sections.
Next, we are providing name and fallbackMethod values to @TimeLimiter annotation. Also, with @BulkHead annotation we are defining a thread pool that will be used. More on @BulkHead in the following sections.
4. Bulkhead
Bulkhead is a way of limiting the number of concurrent executions of a specific method. For example, we can limit the number of concurrent calls to methods that have heavy resource usage, therefore preventing a single feature of the system from affecting the rest of the system by consuming all resources.
Concepts of Bulkhead:
- There are two types of Bulkhead - SEMAPHORE and THREADPOOL type
- First one uses an existing thread that initially called the annotated method, and Semaphore to keep track of the number of concurrent calls, while second one uses threads from specified thread pool, with specified thread pool size
- In case the number of concurrent executions of the method is at max limit, the next call will wait for a predefined time
- If the predefined wait time is up, a BulkheadFullException is thrown and the fallback method is executed
- Otherwise, calls to a method are allowed
- For more, check bulkhead documentation
What we want to achieve
Let’s say our application has one exposed API that is doing some resource intensive work. If we allow an unlimited number of concurrent calls to a given API, that single API can eat up all available resources for the whole application. In that case, our initial API for providing movie rating scores will suffer, as there will be no resources to handle requests in a timely manner. With Bulkhead, we can:
- Set a max number of concurrent calls to our data intensive API, so that the whole service is not affected
- Or, we can assign specific thread pool to handle our data intensive logic, and separate thread pool to handle our rating score logic, in case we want fine-grained control
Hands on
Configuration of simple Bulkhead (first approach). Annotating method with @Bulkhead and providing name and fallbackMethod values.
Annotating method with @Bulkhead and providing name and fallbackMethod values.
5. Retry
Simply, @Retry is enabling failed methods to be executed again, in case exceptions are thrown within. It is behaving the same as Spring's @Retryable annotation, but with more customizable properties.
Concepts of Retry
- Configuring number of retried attempts in case exception occurs, wait time between attempts, fallback method in case all retries are exhausted
- A full list of configuration properties can be found in the official documentation
What we want to achieve
One of the APIs our application uses to fetch movie rating scores is known to fail quite often due to internal errors. It is www.metacritic.com. We are noticing that after a few attempts, API usually becomes stable and returns the expected response. We can configure our logic to:
- Retry 3 times in case any HttpServerErrorException exception occurs
- Wait 2 seconds before retrying
- After retrying for the 3rd time, in case service is still failing, return cached value
Hands on
Retry configuration with few properties.
As always, name and fallbackMethod are provided.
Bonus
Notice, in some configurations in application.yaml we used property registerHealthIndicator with value set to true . This enables showing of details about specific circuit breaker or rate limiter as part of /actuator/health API response, if enabled by management.health properties. To enable it, the following configuration is needed.
Now, /actuator/health will return detailed information about current state of configured circuit breaker or rate limiter that you can keep track of, alongside regular application health data.
Conclusion
We went through all the patterns for solving real use cases when dealing with unpredictable services. Using provided solutions will come handy for sure, as you will encounter these problems at some point. With Resilience4j’s ease of use, solving them is easy, enjoyable and can bring you peace of mind when problems inevitably occur.
About the author
Radomir Marinkovic is a Software Engineer with over five years of experience working at our engineering hub in Novi Sad.
Radomir is experienced in full-stack development, focusing on the backend, tooling, and databases. He has mainly worked with Java/Spring and Node.js and has used Agile methodologies. One of Radomir's interests is web application security, where he won several internal competitions.