Introduction

API gateways maintain a pool of connections to upstream backend services. When backend services become slow or unresponsive, connections in the pool wait for responses until they time out. If enough connections are stuck waiting, the pool becomes exhausted and the gateway cannot accept new requests, returning 503 Service Unavailable errors to all clients.

Symptoms

  • API gateway returns 503 Service Unavailable or 504 Gateway Timeout
  • Gateway logs show upstream connection pool exhausted or no healthy upstream
  • Backend service response times spike before the gateway starts returning 503s
  • Error rate increases suddenly then stays high even after backend recovers
  • Error message: upstream connect error or disconnect/reset before headers. reset reason: connection pool overflow

Common Causes

  • Backend service experiencing high latency due to database query slowdown
  • Connection pool size too small for the traffic volume
  • Backend timeout set too long, keeping connections occupied
  • Cascading failure -- one slow backend endpoint affects all endpoints sharing the pool
  • No circuit breaker configured to shed load when backends are unhealthy

Step-by-Step Fix

  1. 1.Check the gateway's connection pool status: Verify pool exhaustion.
  2. 2.```bash
  3. 3.# Envoy proxy stats
  4. 4.curl localhost:9901/stats | grep upstream_cx
  5. 5.# Check active vs max connections
  6. 6.curl localhost:9901/stats | grep cx_pool_overflow
  7. 7.`
  8. 8.Reduce the backend request timeout: Free up connections faster.
  9. 9.```yaml
  10. 10.# API gateway configuration
  11. 11.routes:
  12. 12.- match: { prefix: "/api/" }
  13. 13.route:
  14. 14.cluster: backend-service
  15. 15.timeout: 10s # Reduced from 60s
  16. 16.retry_policy:
  17. 17.retry_on: "5xx"
  18. 18.num_retries: 2
  19. 19.per_try_timeout: 5s
  20. 20.`
  21. 21.Increase the connection pool size: Allow more concurrent connections.
  22. 22.```yaml
  23. 23.clusters:
  24. 24.- name: backend-service
  25. 25.connect_timeout: 5s
  26. 26.circuit_breakers:
  27. 27.thresholds:
  28. 28.- priority: DEFAULT
  29. 29.max_connections: 1000
  30. 30.max_pending_requests: 1000
  31. 31.max_requests: 1000
  32. 32.`
  33. 33.Implement a circuit breaker to shed load: Prevent pool exhaustion.
  34. 34.```yaml
  35. 35.# Envoy circuit breaker
  36. 36.circuit_breakers:
  37. 37.thresholds:
  38. 38.- max_connections: 100
  39. 39.max_pending_requests: 100
  40. 40.max_requests: 100
  41. 41.consecutive_5xx: 5
  42. 42.interval: 30s
  43. 43.base_ejection_time: 30s
  44. 44.`
  45. 45.Verify the gateway recovers after the fix: Monitor connection pool status.
  46. 46.```bash
  47. 47.# Monitor connection pool metrics
  48. 48.curl localhost:9901/stats | grep -E "cx_active|cx_total|cx_pool_overflow"
  49. 49.# Should show active connections below the pool limit
  50. 50.`

Prevention

  • Set backend request timeouts based on p99 response time plus a safety margin
  • Configure connection pool sizes based on expected concurrent request volume
  • Implement circuit breakers that eject unhealthy upstream hosts
  • Monitor connection pool utilization and alert when it exceeds 80% capacity
  • Use per-endpoint connection pools to prevent one slow endpoint from affecting others
  • Implement request hedging to send duplicate requests to alternate backends when latency is high