Introduction

A circuit breaker protects against cascading failures by stopping requests to a failing service. After a timeout period, it transitions to half-open state and allows a single probe request through. If this probe fails, the circuit returns to open state, continuing to block all requests. When the backend service is recovering slowly or the probe hits a transient issue, the circuit can remain stuck in the open state indefinitely, preventing legitimate requests from ever reaching the recovering service.

Symptoms

  • API continues returning 503 errors even after the backend service has recovered
  • Circuit breaker metrics show oscillation between open and half-open states
  • Backend logs show no incoming requests despite the service being healthy
  • Circuit breaker never transitions to closed state
  • Error message: CircuitBreaker 'backend-service' is OPEN and does not permit further calls

Common Causes

  • Probe request timeout too short, not giving the recovering service enough time
  • Backend service needs warmup time after restart but probe is sent immediately
  • Only one probe request allowed in half-open state -- single failure resets the circuit
  • Probe request hitting a cold cache or uninitialized connection pool on the backend
  • Half-open timeout too short relative to the backend's recovery time

Step-by-Step Fix

  1. 1.Check the circuit breaker state and metrics: Verify the stuck state.
  2. 2.```bash
  3. 3.# Resilience4j metrics (Prometheus)
  4. 4.curl -s http://api-service:8080/actuator/prometheus | grep resilience4j_circuitbreaker_state
  5. 5.# Should show: state="open" for the affected circuit
  6. 6.`
  7. 7.Increase the number of permitted probe requests: Allow more than one test.
  8. 8.```java
  9. 9.// Resilience4j configuration
  10. 10.CircuitBreakerConfig config = CircuitBreakerConfig.custom()
  11. 11..failureRateThreshold(50)
  12. 12..waitDurationInOpenState(Duration.ofSeconds(30))
  13. 13..permittedNumberOfCallsInHalfOpenState(5) // Allow 5 probe requests
  14. 14..slidingWindowSize(10)
  15. 15..build();
  16. 16.`
  17. 17.Increase the half-open probe timeout: Give the backend more time to respond.
  18. 18.```yaml
  19. 19.# Resilience4j YAML config
  20. 20.resilience4j.circuitbreaker:
  21. 21.instances:
  22. 22.backend-service:
  23. 23.waitDurationInOpenState: 30s
  24. 24.permittedNumberOfCallsInHalfOpenState: 5
  25. 25.slowCallDurationThreshold: 10s # Increased from 2s
  26. 26.recordExceptions:
  27. 27.- java.io.IOException
  28. 28.- java.util.concurrent.TimeoutException
  29. 29.`
  30. 30.Implement gradual traffic increase in half-open state: Ramp up slowly.
  31. 31.```java
  32. 32.// Custom half-open logic: gradually increase traffic
  33. 33.if (circuitBreaker.getState() == State.HALF_OPEN) {
  34. 34.// Send a small percentage of traffic
  35. 35.if (random.nextInt(100) < 10) { // 10% of requests
  36. 36.return forwardToBackend(request);
  37. 37.}
  38. 38.}
  39. 39.`
  40. 40.Manually reset the circuit breaker after confirming backend health: Force recovery.
  41. 41.```bash
  42. 42.# Reset the circuit breaker via management endpoint
  43. 43.curl -X POST http://api-service:8080/actuator/circuitbreakers/backend-service/reset
  44. 44.# Verify the circuit is closed
  45. 45.curl -s http://api-service:8080/actuator/circuitbreakers/backend-service | jq '.state'
  46. 46.`

Prevention

  • Configure permittedNumberOfCallsInHalfOpenState to at least 3-5 probe requests
  • Set the half-open probe timeout to match the backend's worst-case response time
  • Implement gradual traffic ramp-up during half-open state
  • Monitor circuit breaker state transitions and alert on prolonged open states
  • Test circuit breaker behavior with simulated backend failures and recoveries
  • Include manual circuit breaker reset endpoints in the API management interface