Introduction
A circuit breaker protects against cascading failures by stopping requests to a failing service. After a timeout period, it transitions to half-open state and allows a single probe request through. If this probe fails, the circuit returns to open state, continuing to block all requests. When the backend service is recovering slowly or the probe hits a transient issue, the circuit can remain stuck in the open state indefinitely, preventing legitimate requests from ever reaching the recovering service.
Symptoms
- API continues returning 503 errors even after the backend service has recovered
- Circuit breaker metrics show oscillation between open and half-open states
- Backend logs show no incoming requests despite the service being healthy
- Circuit breaker never transitions to closed state
- Error message:
CircuitBreaker 'backend-service' is OPEN and does not permit further calls
Common Causes
- Probe request timeout too short, not giving the recovering service enough time
- Backend service needs warmup time after restart but probe is sent immediately
- Only one probe request allowed in half-open state -- single failure resets the circuit
- Probe request hitting a cold cache or uninitialized connection pool on the backend
- Half-open timeout too short relative to the backend's recovery time
Step-by-Step Fix
- 1.Check the circuit breaker state and metrics: Verify the stuck state.
- 2.```bash
- 3.# Resilience4j metrics (Prometheus)
- 4.curl -s http://api-service:8080/actuator/prometheus | grep resilience4j_circuitbreaker_state
- 5.# Should show: state="open" for the affected circuit
- 6.
` - 7.Increase the number of permitted probe requests: Allow more than one test.
- 8.```java
- 9.// Resilience4j configuration
- 10.CircuitBreakerConfig config = CircuitBreakerConfig.custom()
- 11..failureRateThreshold(50)
- 12..waitDurationInOpenState(Duration.ofSeconds(30))
- 13..permittedNumberOfCallsInHalfOpenState(5) // Allow 5 probe requests
- 14..slidingWindowSize(10)
- 15..build();
- 16.
` - 17.Increase the half-open probe timeout: Give the backend more time to respond.
- 18.```yaml
- 19.# Resilience4j YAML config
- 20.resilience4j.circuitbreaker:
- 21.instances:
- 22.backend-service:
- 23.waitDurationInOpenState: 30s
- 24.permittedNumberOfCallsInHalfOpenState: 5
- 25.slowCallDurationThreshold: 10s # Increased from 2s
- 26.recordExceptions:
- 27.- java.io.IOException
- 28.- java.util.concurrent.TimeoutException
- 29.
` - 30.Implement gradual traffic increase in half-open state: Ramp up slowly.
- 31.```java
- 32.// Custom half-open logic: gradually increase traffic
- 33.if (circuitBreaker.getState() == State.HALF_OPEN) {
- 34.// Send a small percentage of traffic
- 35.if (random.nextInt(100) < 10) { // 10% of requests
- 36.return forwardToBackend(request);
- 37.}
- 38.}
- 39.
` - 40.Manually reset the circuit breaker after confirming backend health: Force recovery.
- 41.```bash
- 42.# Reset the circuit breaker via management endpoint
- 43.curl -X POST http://api-service:8080/actuator/circuitbreakers/backend-service/reset
- 44.# Verify the circuit is closed
- 45.curl -s http://api-service:8080/actuator/circuitbreakers/backend-service | jq '.state'
- 46.
`
Prevention
- Configure
permittedNumberOfCallsInHalfOpenStateto at least 3-5 probe requests - Set the half-open probe timeout to match the backend's worst-case response time
- Implement gradual traffic ramp-up during half-open state
- Monitor circuit breaker state transitions and alert on prolonged open states
- Test circuit breaker behavior with simulated backend failures and recoveries
- Include manual circuit breaker reset endpoints in the API management interface