Circuit Breaker - Protect Distributed Systems

What is the Circuit Breaker?

In a few words, it’s a kind of circuit breaker or a design pattern created to add resilience and fault tolerance to distributed systems. It acts as a proxy between the calling service and the target service, preventing cascading failures from bringing down the entire application.

The Problem It Solves?

In a microservices architecture, when service A depends on service B that becomes unavailable, service A starts to accumulate blocked threads waiting for responses that never arrive. This leads to resource exhaustion, which in turn makes service A unavailable as well, and thus the failure propagates in a domino effect throughout the application. The Circuit Breaker solves this by failing fast instead of letting the request hang for long periods.

The Three States

State	Behavior	Transition
Closed	Requests flow normally; failures are monitored	-> Open, when the error threshold is reached
Open	Requests are blocked immediately; returns fallback or error	-> Half-Open, after the recovery timeout
Half-Open	A limited number of test requests is allowed	-> Closed (success) or Open (failure)

In the Closed state, the system operates normally while the circuit breaker monitors the error rate. If the number of failures exceeds a configured threshold (e.g., 15 failures in 60 seconds), it transitions to Open. After a waiting period (cooling-off period), it moves to Half-Open, where it allows some test requests to check if the service has recovered.

Benefits and Challenges

Benefits

Prevents cascading failures
Improves overall system stability by reducing load on failing services
Provides insights into the health and reliability of services

Challenges

Configuring ideal thresholds and timeouts requires deep knowledge of service behavior
Requires well-defined fallback strategies
The pattern assumes the service recovers over time, which is not always true

When to Use

Synchronous calls between microservices
Integrations with external or third-party APIs
Services with high risk of overload or variable latency
High-availability environments where downtime is critical