In my previous post, I have talked about how errors in one service can cause another service to have problems. In this way, the error propagate through our system rendering it useless. I also introduced timeouts and advance patterns using them, and how they help cope with this problem.
But we also saw timeouts have their shortcomings, and that it is sometimes better to fail-fast instead of needlessly wait for a timeout to occur.
In this post, we will examine the Circuit Breaker which is a more advanced pattern that will provide better protection against failures in other services.
The Circuit Breaker
A circuit breaker is a more advanced pattern than timeouts which is designed with the fail-fast philosophy.
A circuit breaker is a term borrowed from the domain of electrical installations. The goal of a circuit breaker is to protect an electrical circuit from damage caused by excess current.
Likewise, in software, we the remote server is experiencing high pressure we would like a mechanism that shuts things off relieving the pressure and allowing for recovery.
A Circuit Breaker is something, in our system, that stands between the caller and the remote machine. Here is a sketch to illustrate:
The gist of it is this: When the remote server works as expected the circuit breaker just proxies the request seamlessly. The interesting bit is what happens when stuff goes wrong. When the circuit breaker detects errors it stops calling the remote server and instead just fails fast. After a while, the circuit breaker will allow new requests to go to the remote server to check if it’s back on again.
Now we will describe the circuit breaker more accurately. The circuit breaker has three states (and not just two). That’s different than the hardware circuit breakers and as I’ll explain it. It’s is quite a nice of a design and we can learn from it and apply such patterns to other design problems we face. The three states are the following:
- Closed State – this is the good state meaning that everything works as expected. In this case, when the circuit breaker gets a request it makes the call to the remote server.
- Open State – After the circuit breaker sees “too many” errors trying to make requests then he enters the open state. In this state, when the circuit breaker receives a request he will NOT make the call to the remote service. Since the circuit breaker seen too many errors, he has a good reason to believe that the remote service will not return a valid response in a timely fashion – so the circuit breaker just fails fast.
- Half-Open State – This is an intriguing state. We go into this state from the Open state after a suitable amount of time has passed. In this state, we allow to make the call to the remote server and transition to the open state if the call has failed, and to the closed state if the call has succeeded.
Here is a sketch of the state machine of the circuit breaker:
Note that we start at the closed state (we assume everything works until observed otherwise) and that we can stay at the closed state and open state for a while (this is represented with a self loop with the condition of when we stay at these states). In contrast, when we reach the Half-Open state we are only there until the next request comes. This is a testing point in time and we move into either the closed state if everything works and to the open state otherwise (until a suitable amount of time passes again).
Before I continue, I’d like to mention some interesting design decisions:
- We don’t go from the closed state to the open state after just one failure. Otherwise, any one-time error would make us fail for the subsequent requests.
- We don’t transition automatically from the open state to the closed state. In we would have done so then we will fail slowly until the required number of failures would have reached and only then we would have transitioned back to the desired state (open) that allows us to fail-fast as we want.
- Instead, having a third state allows us to test the waters and go into the open state after just one failure.
- We don’t continually test the integration with the other service – thus avoiding putting more pressure on it.
How many errors are too many?
I wrote that we transition to an open state after “too many” errors have occurred. But what is too many?
Of course, the answer is context-dependent. However, here are some points to consider:
- We must reset the errors counter after some time – we don’t want to count errors from the beginning of time.
- The frequency of the errors is more interesting then it’s count. For example, even if we reset the counter every minute, 5 failures in 3 seconds are different than a failure every 10 seconds.
- Consider tracking different errors separately – the call might fail for several reasons, a timeout is very much different than an HTTP 429 error (Too Many Requests, e.g a rate limit). The error threshold should be different for different errors.
By design, a circuit breaker has state. When we have more than one instance of the calling service, each has a circuit breaker.
The question is this – should we share the state of the circuit breaker among the different instances?
At first, it may seem like a good idea. Instead of letting all the instances learn independently of the failures in the remote server we’re trying to call, we can share this knowledge and thus fail faster. For example, if we go to an open state after 5 failures and we have 3 instances we would have to make just 5 calls for all instances to go into the open state vs 15 otherwise.
However, I argue the risks are high and suggest you do not share state across instances.
The first reason I have is that the shared state might become a single point of failure. For example, if we keep the circuit breaker state in a DB and that DB instance is non-responsive it can make all instances fail together. This is again the cascading failures scenario we wanted to avoid, to begin with!
Second, sharing state means that the circuit breaker makes some network calls so the state will persist somewhere. Even if everything works, this may increase latency if not implemented carefully. It would be more ideal if the circuit breaker itself could be more transparent and seamless in its closed state.
Third, we now have a new and interesting failure mode: consider a case where one instance is experiencing problems due to problems with the server the instance is running on. In this case, this one instance may communicate that the remote server is unreachable and make all instances go into an open state. Again, an error in one microservice propagated to other microservices.
Visibility & Operations
When implementing a circuit breaker we change the system’s behavior when the circuit breaker is not in a closed state.
For the sake of operations (and thus for your peace of mind too!):
- Log state transitioning
- Allow operations to be able to view the current state of the circuit breaker. By that I mean the number of errors it has seen in a given interval and not just on which state it is on.
- Allow operation to manually do a state transition. You do not want to be stuck in an open state and let operations hung until then.
- Log the different errors the circuit breaker observes (for example log the HTTP status code or if a timeout occurred).
In this post, we introduced the circuit breaker and how it can protect us from errors in other services. We also discussed some of the design choices behind the circuit breaker and gave a few extra angles on the subject.
I hope you found this to be an interesting read!
Until next time,