Introduction
Service mesh 503 Service Unavailable errors occur when the sidecar proxy cannot route requests to healthy upstream services, either because all endpoints are unhealthy, circuit breakers are open, or load balancing configuration prevents successful routing. In service meshes like Istio, Linkerd, and Consul Connect, Envoy sidecars handle service-to-service communication, and 503 errors indicate the proxy itself is rejecting or failing to route requests. Common causes include all upstream hosts marked unhealthy by health checks, circuit breaker tripped due to consecutive errors, outlier detection ejecting all instances, connection pool exhausted, retry policy exhausted without successful response, load balancer configuration sending traffic to unhealthy hosts, service discovery returning empty endpoint list, mTLS handshake failures marking hosts unhealthy, and rate limiting rejecting requests. The fix requires understanding Envoy's upstream health checking, circuit breaker semantics, outlier detection algorithms, and proper service mesh traffic policy configuration. This guide provides production-proven troubleshooting for 503 errors across Istio, Linkerd, Envoy, and Consul Connect.
Symptoms
- HTTP 503 Service Unavailable returned to clients
- Envoy log:
upstream_rq_no_healthy_clusterorno healthy upstream - Istio:
upstream connect error or disconnect/reset before headers - Linkerd:
l5d: connection refusedorno endpoints available - Prometheus:
envoy_cluster_upstream_cx_totalflatlining - Kiali/Linkerd dashboard shows all instances unhealthy
- Circuit breaker open:
envoy_cluster_circuit_breakers_default_open - Outlier detection ejecting hosts:
envoy_cluster_outlier_detection_ejections_active - Retry budget exhausted:
envoy_cluster_retry_upstream_rq_overflow - Connection pool full:
envoy_cluster_upstream_cx_overflow
Common Causes
- All backend instances failing health checks
- Circuit breaker open due to error threshold exceeded
- Outlier detection ejecting all hosts (consecutive 5xx errors)
- Connection pool size too small for traffic volume
- Retry policy exhausting all attempts without success
- Service discovery returning empty endpoint list
- mTLS failures marking all hosts unhealthy
- Load balancer policy sending to unhealthy hosts
- Rate limiting rejecting requests at sidecar
- DestinationRule misconfiguration (wrong subset, port)
Step-by-Step Fix
### 1. Diagnose 503 error source
Check Envoy cluster health:
```bash # Istio: Check cluster endpoints istioctl proxy-config endpoints <pod-name>.<namespace> --cluster outbound|8080||service.namespace.svc.cluster.local
# Check for healthy vs unhealthy endpoints # HEALTHY - endpoint can receive traffic # UNHEALTHY - endpoint failing health checks # DRAINING - endpoint being removed
# View all clusters istioctl proxy-config cluster <pod-name>.<namespace>
# Check cluster stats istioctl proxy-config envoy <pod-name>.<namespace> --stats | grep -E "upstream_cx|upstream_rq|health" ```
Analyze 503 error details:
```bash # Enable debug logging istioctl proxy-config log <pod-name>.<namespace> --level debug
# Or specific component istioctl proxy-config log <pod-name>.<namespace> --level upstream:debug
# Check access log for 503 responses kubectl logs <pod-name> -c istio-proxy | grep " 503 "
# Linkerd: Check tap for 503s linkerd tap deployment/<service> -n <namespace> --only-failed ```
### 2. Fix health check configuration
Configure proper health checks:
yaml
# DestinationRule with health check settings
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-service
namespace: production
spec:
host: my-service.production.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5 # Eject after 5 consecutive 5xx
interval: 30s # Check interval
baseEjectionTime: 30s # How long to eject
maxEjectionPercent: 50 # Max % of hosts to eject
minHealthPercent: 10 # Minimum healthy hosts
loadBalancer:
simple: ROUND_ROBIN
Fix health check endpoint:
yaml
# Ensure application has health endpoint
apiVersion: v1
kind: Service
metadata:
name: my-service
namespace: production
spec:
selector:
app: my-service
ports:
- port: 8080
targetPort: http
---
# Deployment with proper health checks
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service
namespace: production
spec:
template:
spec:
containers:
- name: app
ports:
- name: http
containerPort: 8080
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
### 3. Fix circuit breaker configuration
Adjust circuit breaker settings:
yaml
# DestinationRule with circuit breaker
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-service
namespace: production
spec:
host: my-service.production.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1000 # Max connections to all hosts
http:
http1MaxPendingRequests: 100 # Pending requests queue
http2MaxRequests: 1000 # Max concurrent requests
maxRequestsPerConnection: 10 # Requests per connection
maxRetries: 3 # Max retries
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 100 # Allow all to be ejected if needed
Monitor circuit breaker state:
```bash # Check circuit breaker stats istioctl proxy-config envoy <pod-name>.<namespace> --stats | \ grep circuit_breaker
# Key metrics: # envoy.cluster.circuit_breakers.default.open: 1 = OPEN (blocking) # envoy.cluster.circuit_breakers.default.rq_pending_open: Pending requests # envoy.cluster.circuit_breakers.default.rq_retry_open: Retry overflow
# Reset circuit breaker (if stuck) kubectl rollout restart deployment/<deployment-name> -n <namespace> ```
### 4. Fix outlier detection
Tune outlier detection:
yaml
# Aggressive outlier detection causing issues
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-service
namespace: production
spec:
host: my-service.production.svc.cluster.local
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 3 # Too aggressive
interval: 5s # Too frequent
baseEjectionTime: 60s # Too long
maxEjectionPercent: 10 # Too low
---
# Balanced configuration
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-service
namespace: production
spec:
host: my-service.production.svc.cluster.local
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5 # Allow some transient errors
interval: 30s # Reasonable check interval
baseEjectionTime: 30s # Short enough to recover
maxEjectionPercent: 50 # Keep some capacity
minHealthPercent: 10 # Ensure minimum healthy hosts
enforcingConsecutive5xxErrors: 50 # Only enforce 50% of time
### 5. Fix connection pool exhaustion
Increase connection pool size:
yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-service
namespace: production
spec:
host: my-service.production.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 5000 # Increase from default 1024
connectTimeout: 10s
tcpKeepalive:
time: 30s
interval: 5s
probes: 3
http:
http1MaxPendingRequests: 1000
http2MaxRequests: 5000
maxRequestsPerConnection: 100
maxRetries: 5
idleTimeout: 60s
h2UpgradePolicy: UPGRADE
Monitor connection pool:
```bash # Check connection pool stats istioctl proxy-config envoy <pod-name>.<namespace> --stats | \ grep -E "cx_total|cx_active|cx_overflow"
# Key metrics: # upstream_cx_total: Total connections created # upstream_cx_active: Currently active connections # upstream_cx_overflow: Connection pool exceeded # upstream_rq_pending_overflow: Request queue overflow ```
### 6. Fix retry policy issues
Configure retry policy:
```yaml # VirtualService with retry apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: my-service namespace: production spec: hosts: - my-service http: - route: - destination: host: my-service retries: attempts: 3 # Number of retries perTryTimeout: 2s # Timeout per attempt retryOn: 5xx,reset,connect-failure,retriable-4xx timeout: 10s # Total timeout
# Don't over-retry (can make things worse) apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: my-service namespace: production spec: hosts: - my-service http: - route: - destination: host: my-service retries: attempts: 1 # Single retry perTryTimeout: 5s retryOn: 503,504 ```
### 7. Fix service discovery issues
Check endpoint discovery:
```bash # Istio: Check endpoints istioctl proxy-config endpoints <pod-name>.<namespace>
# Should show healthy endpoints with IP:port # If empty, service discovery issue
# Kubernetes: Check endpoints kubectl get endpoints my-service -n production kubectl describe endpoints my-service -n production
# Check if pods are ready kubectl get pods -l app=my-service -n production kubectl get pods -l app=my-service -n production -o wide
# Verify Service selector matches pod labels kubectl get service my-service -n production -o jsonpath='{.spec.selector}' kubectl get pods -l app=my-service -n production --show-labels ```
Fix service selector mismatch:
yaml
# WRONG: Selector doesn't match pod labels
apiVersion: v1
kind: Service
metadata:
name: my-service
namespace: production
spec:
selector:
app: wrong-label # Doesn't match any pods!
ports:
- port: 8080
---
# CORRECT: Selector matches pod labels
apiVersion: v1
kind: Service
metadata:
name: my-service
namespace: production
spec:
selector:
app: my-service # Matches deployment labels
ports:
- port: 8080
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service
namespace: production
spec:
selector:
matchLabels:
app: my-service
template:
metadata:
labels:
app: my-service # Must match Service selector
Prevention
- Monitor endpoint health with alerts on unhealthy ratio
- Set circuit breaker thresholds based on load testing
- Use gradual traffic shifting for new deployments
- Configure appropriate retry policies (avoid retry storms)
- Test failure scenarios with chaos engineering
- Document connection pool sizing based on traffic patterns
- Use canary deployments to catch issues early
- Set up distributed tracing for request flow visibility
- Regular capacity planning and load testing
- Implement proper health check endpoints in applications
Related Errors
- **Service mesh sidecar injection failed**: Sidecar not injected
- **Service mesh mTLS connection failed**: Certificate issues
- **Service mesh rate limiting configuration error**: Rate limit misconfiguration
- **Service mesh destination rule configuration error**: Traffic policy errors