What is the Circuit Breaker?

In a few words, it’s a kind of circuit breaker or a design pattern created to add resilience and fault tolerance to distributed systems. It acts as a proxy between the calling service and the target service, preventing cascading failures from bringing down the entire application.

The Problem It Solves?

In a microservices architecture, when service A depends on service B that becomes unavailable, service A starts to accumulate blocked threads waiting for responses that never arrive. This leads to resource exhaustion, which in turn makes service A unavailable as well, and thus the failure propagates in a domino effect throughout the application. The Circuit Breaker solves this by failing fast instead of letting the request hang for long periods.

The Three States

StateBehaviorTransition
ClosedRequests flow normally; failures are monitored-> Open, when the error threshold is reached
OpenRequests are blocked immediately; returns fallback or error-> Half-Open, after the recovery timeout
Half-OpenA limited number of test requests is allowed-> Closed (success) or Open (failure)

In the Closed state, the system operates normally while the circuit breaker monitors the error rate. If the number of failures exceeds a configured threshold (e.g., 15 failures in 60 seconds), it transitions to Open. After a waiting period (cooling-off period), it moves to Half-Open, where it allows some test requests to check if the service has recovered.

Benefits and Challenges

Benefits

  • Prevents cascading failures
  • Improves overall system stability by reducing load on failing services
  • Provides insights into the health and reliability of services

Challenges

  • Configuring ideal thresholds and timeouts requires deep knowledge of service behavior
  • Requires well-defined fallback strategies
  • The pattern assumes the service recovers over time, which is not always true

When to Use

  • Synchronous calls between microservices
  • Integrations with external or third-party APIs
  • Services with high risk of overload or variable latency
  • High-availability environments where downtime is critical

References