Introduction
When a message broker experiences degraded performance or temporary unavailability, producer retry logic can amplify the problem. Each timed-out message triggers retries, multiplying the load on an already struggling broker. This cascading retry storm trips the circuit breaker, which then blocks all message production -- including messages that would have succeeded.
Symptoms
- Circuit breaker transitions to OPEN state after exceeding failure threshold
- Producer logs show repeated
RequestTimedOutExceptionfollowed byCircuitBreakerOpenException - Broker CPU and network I/O spike from retry traffic amplification
- All producer threads blocked waiting for circuit breaker half-open transition
- Downstream services stop receiving events, causing stale data and processing gaps
Common Causes
- Retry count set too high with no exponential backoff, overwhelming the broker
- Circuit breaker failure threshold too low relative to normal transient error rate
- Broker underprovisioned for peak load, causing request queue buildup and timeouts
- Multiple producer instances all retrying simultaneously without jitter
- No backpressure mechanism between producers and the message broker
Step-by-Step Fix
- 1.Check circuit breaker state and configuration: Inspect the current circuit breaker status.
- 2.```bash
- 3.curl -s http://producer-service:8080/actuator/health | jq '.components.circuitBreaker'
- 4.
` - 5.Reduce retry count and add exponential backoff with jitter: Prevent retry storms from overwhelming the broker.
- 6.```java
- 7.Retry retry = Retry.custom("producer", RetryConfig.custom()
- 8..maxAttempts(3)
- 9..waitDuration(Duration.ofMillis(500))
- 10..intervalFunction(IntervalFunction.ofExponentialRandomBackoff(500, 2.0))
- 11..retryExceptions(RequestTimedOutException.class)
- 12..build());
- 13.
` - 14.Increase producer request timeout to accommodate broker load spikes: Give the broker more time to process under load.
- 15.```properties
- 16.request.timeout.ms=60000
- 17.delivery.timeout.ms=120000
- 18.linger.ms=50
- 19.batch.size=65536
- 20.
` - 21.Implement backpressure to throttle producers during broker degradation: Use a bounded send queue.
- 22.```java
- 23.// Block when send buffer is full instead of queuing unboundedly
- 24.producer.send(record).get(30, TimeUnit.SECONDS);
- 25.
` - 26.Monitor circuit breaker metrics and broker health: Set up alerting for early warning.
- 27.```bash
- 28.# Monitor Resilience4j circuit breaker metrics
- 29.curl -s http://producer-service:8080/actuator/prometheus | grep resilience4j
- 30.
`
Prevention
- Configure exponential backoff with jitter for all producer retries to avoid synchronized retry storms
- Set circuit breaker thresholds based on historical error rates, not theoretical expectations
- Implement producer-side rate limiting that adapts to broker response times
- Use separate circuit breakers per topic to prevent a single degraded topic from blocking all production
- Monitor broker request queue depth and response latency as leading indicators of timeout risk
- Implement graceful degradation with local message buffering during circuit breaker open states