Introduction

When a message broker experiences degraded performance or temporary unavailability, producer retry logic can amplify the problem. Each timed-out message triggers retries, multiplying the load on an already struggling broker. This cascading retry storm trips the circuit breaker, which then blocks all message production -- including messages that would have succeeded.

Symptoms

  • Circuit breaker transitions to OPEN state after exceeding failure threshold
  • Producer logs show repeated RequestTimedOutException followed by CircuitBreakerOpenException
  • Broker CPU and network I/O spike from retry traffic amplification
  • All producer threads blocked waiting for circuit breaker half-open transition
  • Downstream services stop receiving events, causing stale data and processing gaps

Common Causes

  • Retry count set too high with no exponential backoff, overwhelming the broker
  • Circuit breaker failure threshold too low relative to normal transient error rate
  • Broker underprovisioned for peak load, causing request queue buildup and timeouts
  • Multiple producer instances all retrying simultaneously without jitter
  • No backpressure mechanism between producers and the message broker

Step-by-Step Fix

  1. 1.Check circuit breaker state and configuration: Inspect the current circuit breaker status.
  2. 2.```bash
  3. 3.curl -s http://producer-service:8080/actuator/health | jq '.components.circuitBreaker'
  4. 4.`
  5. 5.Reduce retry count and add exponential backoff with jitter: Prevent retry storms from overwhelming the broker.
  6. 6.```java
  7. 7.Retry retry = Retry.custom("producer", RetryConfig.custom()
  8. 8..maxAttempts(3)
  9. 9..waitDuration(Duration.ofMillis(500))
  10. 10..intervalFunction(IntervalFunction.ofExponentialRandomBackoff(500, 2.0))
  11. 11..retryExceptions(RequestTimedOutException.class)
  12. 12..build());
  13. 13.`
  14. 14.Increase producer request timeout to accommodate broker load spikes: Give the broker more time to process under load.
  15. 15.```properties
  16. 16.request.timeout.ms=60000
  17. 17.delivery.timeout.ms=120000
  18. 18.linger.ms=50
  19. 19.batch.size=65536
  20. 20.`
  21. 21.Implement backpressure to throttle producers during broker degradation: Use a bounded send queue.
  22. 22.```java
  23. 23.// Block when send buffer is full instead of queuing unboundedly
  24. 24.producer.send(record).get(30, TimeUnit.SECONDS);
  25. 25.`
  26. 26.Monitor circuit breaker metrics and broker health: Set up alerting for early warning.
  27. 27.```bash
  28. 28.# Monitor Resilience4j circuit breaker metrics
  29. 29.curl -s http://producer-service:8080/actuator/prometheus | grep resilience4j
  30. 30.`

Prevention

  • Configure exponential backoff with jitter for all producer retries to avoid synchronized retry storms
  • Set circuit breaker thresholds based on historical error rates, not theoretical expectations
  • Implement producer-side rate limiting that adapts to broker response times
  • Use separate circuit breakers per topic to prevent a single degraded topic from blocking all production
  • Monitor broker request queue depth and response latency as leading indicators of timeout risk
  • Implement graceful degradation with local message buffering during circuit breaker open states