In today’s world, for our service to function and deliver value we must interact and call other services. Each service we interact with is a potential failure point in our system, and as the number of such integration points grows so does the failure rate. When this is the case, we must ask ourselves – how can we protect our service by a failure in another service (even from other services we wrote!).
In my previous post, I have talked about the request-response type of architecture and the downsides it can cause to our system design.
In this post, we will dive deep into how we can protect the issuer of a request against failures or unpredictable behavior in the responding service. We will learn about timeouts and effective usage patterns using them.
In the next post, we will also examine the Circuit Breaker which is a more advanced pattern that will provide better protection and enable us to fail-fast instead of needlessly wait.
If you’re not sure exactly what a request-response architectural style is, or what is the alternative, then I recommend taking a couple of minutes to read my previous post that gives a good background to this post – although this post is self-contained.
What Can Possibly Go Wrong?
Let’s go over a brief example, to illustrate the problem. Whenever a user signs up for our website we want to send the user a welcome mail. We do this by sending a request from the user service to the email service.
But what happens to the user service when:
- The email service is down
- The email service responds with an HTTP 500 Internal Server Error?
- The email service is under a high load and answers very slowly
If we don’t protect the user service, then those errors in the downstream email service will propagate to our service in a cascading manner.
Let’s analyze these scenarios and see how they might affect us. To do this we have to remember that a “connection” is an abstraction. We will have to peel off some of the layers of abstractions to see how things can go wrong.
The first scenario is that the email service is completely down. It looks like this is the worst scenario among the three but I would say it’s the opposite!
When we try to reach the email service via HTTP request then the underlying TCP protocol will try to establish a connection with the remote machine the email service is running on. If no one is listening on the port we’re trying to connect to then we get a quick error which manifests as an exception (or it’s equivalent status code in some languages) in the calling code. We only have to be aware of this possibility to handle it.
The second type of error is a bit more devious. As we got a response, we may try to parse it. We may expect a JSON, while in some cases we will get an HTML that describes the failure leading the calling application into an error it didn’t foresee.
By far, I would say that the last option is the worst among the three. Why? here’s an example – remember that HTTP over TCP first tries to open a connection. The remote application has an inner listening queue to handle the incoming connection request. Since the email service is under high load we may be in the listening queue for quite a while. This is the worst place to be at. The OS will block the thread in the user service that tried to open to the connection until a timeout. “Yay timeouts” you may think, but the reality is that these timeouts, which vary from OS to OS, are measured in minutes and not seconds, less yay. Moreover, many client-side libraries don’t allow us the fine-grained control to define these timeouts ourselves leaving us with the inability to handle those situations well.
These are just a handful of a seemingly endless list of things that can go wrong. Each error kept someone out there awake late at night when his system didn’t handle the error well.
I want you to have a good night’s sleep. Let’s see what we can do to make our system more stable when the world around us acts crazy.
Timeouts are our first line of defense. If a response doesn’t come back within a predefined amount of time we consider this to be a failure and drop all resources used to issue the request.
After a timeout occurred we have a few options in our hands. First, we can perform a retry, which might help if there was some network glitch. We can also log the event, or display something to the user. In any case, we want to avoid being hung and timeouts help us achieve this.
While retries are quite well known, some of the patterns of using them wisely are less known.
The main question is when to retry again and at what intervals?
The worst thing we can do is to try again immediately. If the remote application we are calling experienced a load we just increased it making it even more likely for it to recover. Some backoff is needed.
A naive approach is to try at constant intervals. If we make the retries within a call the user had made (and he’s waiting for a response) then this approach is viable only for a short duration. Still, it is much better than having a zero backoff or blocking forever. If, however, we are running in a background process that we can keep up the retries for a long time. I recommend being careful here as there is a waste of resources and if the remote application does not recover we may get to the point that we exhausted our thread pool blocking the user from making any future calls – rendering our system useless.
A more advanced pattern, fitted when we are running in the background, is to use exponential backoffs. Exponential backoffs mean that the time between retries increases in a geometric progression. For instance, we can use the following times between attempts: 1 second, 2 seconds, 10 seconds, 1 minute, 5 minutes, 30 minutes, 3 hours, 15 hours. here every increase is by a factor of about 5 times. Since we make just a handful of requests we waist far fewer resources.
Taking things even further note that if we have a lot of calls that failed when the remote application we are trying to reach was non-responsive, then we will bombard the remote application with all those messages whenever we attempt the retries.
A clever solution to this is spreading the requests by adding to the time between tries some randomness so that not all requests will hit the remote application at the same time. Keep in mind that if you do add some randomness that requests may be made in a different order than the one that our system first generated. For most applications that won’t be a problem, but for some cases, this may be problematic.
Shortcomings of Timeouts
While timeouts have their advantages they are far from being perfect.
While we wait for a timeout to occur we tie up resources, usually a thread that made the request. If we have too many threads blocked until a timeout – we may exhaust our thread pool and have no threads available to handle new incoming traffic. So even with timeouts, errors may still propagate!
When we wait for a timeout we also need to ask if the end user is also waiting. If so, wouldn’t it be better to immediately show him that there is some problem instead of letting him wait ? Of course, we can’t always do that, but if we already observed several timeouts beforehand we can assume this is still the case, until some time passes since the last time we saw an error.
To overcome these shortcomings and to fail-fast, instead of continually wait, I will introduce the circuit breaker in my next post
In this post, we talked about a few of the things that can make our system unstable.
We then talked about timeouts and retries which is a fundamental pattern to increase stability yo our system. But we talked about the shortcomings timeouts and retries has too.
In the next post I will introduce the circuit breaker and how it can protect us from errors in other services. We will also discussed some of the design choices behind the circuit breaker and give a few extra angles on the subject.
I hope you found this to be an interesting read!
Until next time,