Introduction
Envoy proxy sidecar errors occur when the data plane proxy fails to receive configuration from the control plane, cannot reach upstream services, or trips circuit breakers under load. Envoy is the most widely used service mesh data plane, powering Istio, AWS App Mesh, and standalone deployments. As a sidecar, Envoy intercepts all inbound and outbound traffic for the application container, making it a critical dependency. Common causes include xDS (CDS/EDS/LDS/RDS) configuration failures, cluster discovery errors, endpoint health check failures, circuit breaker threshold trips, TLS/mTLS handshake failures, resource exhaustion (memory, file descriptors), incorrect virtual host routing, rate limit service unreachable, and access log configuration errors. The fix requires understanding Envoy's architecture, xDS protocol, configuration validation, and debugging tools. This guide provides production-proven troubleshooting for Envoy sidecar issues across Istio, Linkerd, AWS App Mesh, and standalone Envoy deployments.
Symptoms
- Envoy sidecar not starting or crashlooping in Kubernetes pod
Envoy exited with code 137(OOM killed)upstream_connect_errorin access logsno_healthy_upstreamorno_cluster_founderrors- 503 Service Unavailable returned to clients
- 504 Gateway Timeout from Envoy
- Circuit breaker tripped:
rq_pending_overfloworrq_total_overflow - xDS connection failures:
gRPC connection to control plane failed - TLS handshake failures:
SSL23_GET_SERVER_HELLO:unknown protocol - High latency spikes correlating with Envoy CPU usage
- Pod stuck in
Init:0/1or sidecar not ready - Envoy admin interface unreachable at localhost:15000
Common Causes
- Control plane (Istiod, Linkerd destination controller) unreachable
- xDS gRPC connection timeout or certificate validation failure
- Cluster configuration references non-existent service
- Endpoint discovery returns no healthy hosts
- Health check interval too aggressive causing false positives
- Circuit breaker thresholds too low for traffic volume
- Connection pool exhaustion (max_connections, max_pending_requests)
- mTLS certificate rotation failure or expiration
- Virtual service routing rules misconfigured
- Rate limit service timeout or unavailable
- Envoy memory/CPU limits too low for traffic volume
- Init container (istio-init) failed to configure iptables
Step-by-Step Fix
### 1. Check Envoy sidecar status
Verify sidecar container health:
```bash # Check pod status with sidecar kubectl get pods -n namespace -o wide # Look for: app-pod-xxxx 2/2 Running 0 5m
# Check sidecar specifically kubectl get pod app-pod-xxxx -n namespace -o jsonpath='{.status.containerStatuses[?(@.name=="istio-proxy")]}'
# Check sidecar logs kubectl logs app-pod-xxxx -c istio-proxy -n namespace
# Common log patterns:
# xDS connection failure # warning envoy config external provider istiod://xds.istio-system.svc:15012 connection failed
# Upstream connection failure # upstream_connect_error: connection_refused
# Circuit breaker trip # rq_pending_overflow: 15
# OOM killed # Exit code 137
# Check Envoy admin interface kubectl exec app-pod-xxxx -c istio-proxy -n namespace -- curl -s localhost:15000/config_dump kubectl exec app-pod-xxxx -c istio-proxy -n namespace -- curl -s localhost:15000/clusters kubectl exec app-pod-xxxx -c istio-proxy -n namespace -- curl -s localhost:15000/server_info ```
Envoy admin interface endpoints:
```bash # Config dump (full xDS configuration) curl localhost:15000/config_dump | jq '.configs[]'
# Cluster status curl localhost:15000/clusters
# Listeners curl localhost:15000/listeners
# Stats (metrics) curl localhost:15000/stats | grep -E "upstream|circuit_breaker|rq_"
# Server info (version, status) curl localhost:15000/server_info
# Initiate config reload curl -X POST localhost:15000/config_reload
# Check readiness curl localhost:15000/ready # Returns: OK if ready, LIVEDRAIN if draining, NOT_STARTED if initializing ```
### 2. Diagnose xDS configuration failures
xDS is the protocol Envoy uses to receive configuration:
```bash # Check xDS client status in Envoy kubectl exec app-pod-xxxx -c istio-proxy -- curl -s localhost:15000/server_info | jq '.xds_client'
# Expected output: # { # "xds_client": { # "status": "connected", # "control_plane": "xds.istio-system.svc:15012", # "last_config_update": "2026-03-31T10:00:00Z" # } # }
# Check xDS resource status kubectl exec app-pod-xxxx -c istio-proxy -- curl -s localhost:15000/config_dump | jq ' .configs[] | { type: .["@type"], version: .version_info, last_updated: .last_updated } '
# xDS resource types: # - type.googleapis.com/envoy.config.listener.v3.Listener (LDS) # - type.googleapis.com/envoy.config.route.v3.RouteConfiguration (RDS) # - type.googleapis.com/envoy.config.cluster.v3.Cluster (CDS) # - type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment (EDS) ```
Debug xDS connection issues:
```yaml # Istiod logs (control plane) kubectl logs -l app=istiod -n istio-system
# Look for: # - Push errors to specific proxies # - Configuration validation failures # - Resource exhaustion
# Check xDS push status istioctl proxy-status
# Output shows all connected proxies: # NAME CLUSTER CDS LDS EDS RDS ISTIOD # app-pod-xxxx.default Kubernetes SYNCED SYNCED SYNCED SYNCED istiod-abc123
# If NOT SYNCED, investigate: # - Network connectivity to istiod # - Pilot agent configuration # - Resource quota on namespace
# Enable Envoy debug logging kubectl exec app-pod-xxxx -c istio-proxy -- curl -X POST \ "localhost:15000/logging?level=debug"
# Or specific component kubectl exec app-pod-xxxx -c istio-proxy -- curl -X POST \ "localhost:15000/logging?level=upstream:debug" ```
### 3. Fix cluster discovery failures
Clusters define upstream service endpoints:
```bash # Check cluster status kubectl exec app-pod-xxxx -c istio-proxy -- curl -s localhost:15000/clusters
# Output format: # outbound|8080||api-service.default.svc.cluster.local::default:: # priority::0 # local_origins_priority::0 # max_requests::1024 # max_pending_requests::1024 # max_retries::3 # max_connections::1024 # upstream_rq_active::0 # upstream_rq_pending_active::0 # upstream_cx_total::150 # upstream_rq_total::1500
# Check for clusters with no endpoints kubectl exec app-pod-xxxx -c istio-proxy -- curl -s localhost:15000/clusters | \ grep -B5 "healthy_priority_msgs::0"
# Query EDS for specific cluster kubectl exec app-pod-xxxx -c istio-proxy -- curl -s \ "localhost:15000/config_dump?resource=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment" | \ jq '.configs[] | select(.cluster_name == "outbound|8080||api-service.default.svc.cluster.local")' ```
Fix cluster configuration:
```yaml # DestinationRule for cluster settings apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: api-service-dr namespace: default spec: host: api-service.default.svc.cluster.local trafficPolicy: # Connection pool settings connectionPool: tcp: maxConnections: 100 connectTimeout: 10s http: h2UpgradePolicy: UPGRADE http1MaxPendingRequests: 100 http2MaxRequests: 1000 maxRequestsPerConnection: 100 maxRetries: 3
# Load balancing settings loadBalancer: simple: LEAST_REQ # or ROUND_ROBIN, RANDOM, LEAST_CONN
# Health check (active) outlierDetection: consecutive5xxErrors: 5 interval: 30s baseEjectionTime: 30s maxEjectionPercent: 50 minHealthPercent: 30
# TLS settings tls: mode: ISTIO_MUTUAL # or DISABLE, SIMPLE, MUTUAL, ISTIO_MUTUAL
# Verify service exists and has endpoints kubectl get svc api-service -n default kubectl get endpoints api-service -n default
# If no endpoints, check pod selectors kubectl get svc api-service -n default -o jsonpath='{.spec.selector}' kubectl get pods -n default -l app=api-service --show-labels ```
### 4. Fix circuit breaker trips
Circuit breakers protect against cascading failures:
```bash # Check circuit breaker stats kubectl exec app-pod-xxxx -c istio-proxy -- curl -s localhost:15000/stats | \ grep -E "circuit_breakers|rq_overflow"
# Key metrics: # cluster.outbound|8080||api-service.circuit_breakers.default.rq_pending_overflow # cluster.outbound|8080||api-service.circuit_breakers.default.rq_total_overflow # cluster.outbound|8080||api-service.circuit_breakers.default.cx_overflow
# Check overflow count # If > 0, circuit breaker is tripping
# Monitor in real-time watch -n1 'kubectl exec app-pod-xxxx -c istio-proxy -- curl -s localhost:15000/stats | grep overflow' ```
Tune circuit breaker thresholds:
```yaml # DestinationRule with circuit breaker settings apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: api-service-cb namespace: default spec: host: api-service.default.svc.cluster.local trafficPolicy: connectionPool: http: # Maximum pending requests (queue before circuit breaker) http1MaxPendingRequests: 1024 # Maximum active requests http2MaxRequests: 2048 # Max requests per connection before draining maxRequestsPerConnection: 100 tcp: # Maximum connections maxConnections: 1024
# Outlier detection (circuit breaker logic) outlierDetection: # Number of 5xx errors before ejection consecutive5xxErrors: 5 # Time between ejection checks interval: 10s # How long to eject a host baseEjectionTime: 30s # Max % of hosts ejected at once maxEjectionPercent: 50 # Minimum % of hosts that must remain healthy minHealthPercent: 30 # Don't eject if success rate is above this successRateMinimumHosts: 5 # Eject if success rate < threshold successRateStdevFactor: 1900 # 1.9 standard deviations ```
Circuit breaker patterns:
```yaml # For high-traffic services, relax thresholds trafficPolicy: connectionPool: http: http1MaxPendingRequests: 10000 http2MaxRequests: 10000 tcp: maxConnections: 10000 outlierDetection: consecutive5xxErrors: 10 # More tolerant interval: 5s # Faster detection baseEjectionTime: 10s # Faster recovery
# For critical services, strict circuit breaker trafficPolicy: outlierDetection: consecutive5xxErrors: 3 # Fail fast interval: 5s baseEjectionTime: 60s # Longer recovery maxEjectionPercent: 100 # Can eject all hosts
# Disable circuit breaker (use with caution) trafficPolicy: outlierDetection: consecutive5xxErrors: -1 # Disabled ```
### 5. Fix TLS/mTLS handshake failures
Debug TLS connection issues:
```bash # Check TLS stats kubectl exec app-pod-xxxx -c istio-proxy -- curl -s localhost:15000/stats | \ grep -E "ssl|tls|handshake"
# Key metrics: # cluster.outbound|443||external-service.ssl.handshake # cluster.outbound|443||external-service.ssl.connection_error # cluster.outbound|443||external-service.ssl.verify_error
# Check certificate status kubectl exec app-pod-xxxx -c istio-proxy -- cat /etc/certs/cert-chain.pem | \ openssl x509 -noout -dates -subject -issuer
# Verify certificate chain kubectl exec app-pod-xxxx -c istio-proxy -- openssl verify -CAfile /etc/certs/root-cert.pem \ /etc/certs/cert-chain.pem
# Test mTLS connectivity istioctl authn tls-check app-pod-xxxx.default api-service.default ```
Fix mTLS configuration:
```yaml # PeerAuthentication for mTLS mode apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: default spec: # mTLS mode options: # - UNSET: Use namespace/global setting # - DISABLE: No mTLS # - PERMISSIVE: Accept both plaintext and mTLS # - STRICT: Require mTLS only mtls: mode: STRICT
# For gradual migration, use PERMISSIVE first # Then move to STRICT after all services have sidecars
# DestinationRule to enable mTLS for outbound apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: api-service-mtls namespace: default spec: host: api-service.default.svc.cluster.local trafficPolicy: tls: mode: ISTIO_MUTUAL # Uses Istio-managed certs
# For external services with TLS apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: external-api-tls spec: host: api.external.com trafficPolicy: tls: mode: SIMPLE # Standard TLS (not mTLS) caCertificates: /etc/certs/external-ca.pem ```
Certificate rotation issues:
```bash # Check certificate expiration kubectl exec app-pod-xxxx -c istio-proxy -- bash -c ' echo "=== Certificate Chain ===" openssl x509 -in /etc/certs/cert-chain.pem -noout -dates echo "" echo "=== Root Certificate ===" openssl x509 -in /etc/certs/root-cert.pem -noout -dates '
# Force certificate rotation kubectl delete pod app-pod-xxxx -n namespace # New pod will get fresh certificates
# Check Istio CA status kubectl get mutatingwebhookconfiguration istiod-default -o yaml | \ grep -A5 caBundle
# For cert-manager integration kubectl get certificate -n istio-system kubectl describe certificate -n istio-system istiod ```
### 6. Fix connection pool exhaustion
Connection pool tuning:
```bash # Check connection pool stats kubectl exec app-pod-xxxx -c istio-proxy -- curl -s localhost:15000/stats | \ grep -E "cx_|upstream_cx_"
# Key metrics: # cluster.outbound|8080||service.upstream_cx_active # Current connections # cluster.outbound|8080||service.upstream_cx_total # Total connections created # cluster.outbound|8080||service.upstream_cx_destroy # Connections destroyed # cluster.outbound|8080||service.upstream_cx_overflow # Connection pool full ```
Tune connection pool settings:
```yaml apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: api-service-connpool namespace: default spec: host: api-service.default.svc.cluster.local trafficPolicy: connectionPool: tcp: # Maximum TCP connections maxConnections: 500 # TCP keepalive tcpKeepalive: time: 7200s # Send keepalive after 2h idle interval: 75s # Send keepalive every 75s probes: 10 # Send 10 probes before closing
http: # Max HTTP/1.1 pending requests (queued waiting for connection) http1MaxPendingRequests: 256 # Max concurrent HTTP/2 requests http2MaxRequests: 1024 # Max requests per connection before creating new maxRequestsPerConnection: 100 # Idle timeout for HTTP connections idleTimeout: 300s ```
HTTP/2 vs HTTP/1.1 considerations:
```yaml # HTTP/2 multiplexing (recommended) connectionPool: http: h2UpgradePolicy: UPGRADE # Upgrade H1->H2 http2MaxRequests: 2048 maxRequestsPerConnection: 1000
# HTTP/1.1 (legacy compatibility) connectionPool: http: h2UpgradePolicy: DO_NOT_UPGRADE http1MaxPendingRequests: 512 http1MaxHeadersRecv: 65536 # Max header size ```
### 7. Fix virtual service routing issues
Debug routing configuration:
```bash # Check virtual service kubectl get virtualservice api-service -n default -o yaml
# Check route config dump kubectl exec app-pod-xxxx -c istio-proxy -- curl -s localhost:15000/config_dump | \ jq '.configs[] | select(.["@type"] == "type.googleapis.com/envoy.config.route.v3.RouteConfiguration")'
# Test routing istioctl analyze -n default
# Simulate request routing istioctl proxy-config route app-pod-xxxx.default --name http.8080 -o yaml ```
Fix virtual service configuration:
```yaml apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: api-service namespace: default spec: hosts: - api-service.default.svc.cluster.local
# Gateways to apply (empty = mesh-wide) gateways: - mesh
http: # Route matching (first match wins) - match: - headers: x-api-version: exact: "v2" uri: prefix: /api/v2 route: - destination: host: api-service-v2 port: number: 8080 weight: 100
# Default route with traffic splitting - route: - destination: host: api-service-v1 port: number: 8080 weight: 90 - destination: host: api-service-v2 port: number: 8080 weight: 10
# Timeout and retry settings timeout: 30s retries: attempts: 3 perTryTimeout: 10s retryOn: 5xx,reset,connect-failure,retriable-4xx
# Fault injection (testing only) # fault: # abort: # percentage: # value: 1 # httpStatus: 500 # delay: # percentage: # value: 1 # fixedDelay: 5s ```
### 8. Fix rate limiting issues
Rate limit service configuration:
yaml
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: rate-limit
namespace: istio-system
spec:
workloadSelector:
labels:
istio: ingressgateway
configPatches:
- applyTo: HTTP_FILTER
match:
context: GATEWAY
listener:
filterChain:
filter:
name: "envoy.filters.network.http_connection_manager"
subFilter:
name: "envoy.filters.http.router"
patch:
operation: INSERT_BEFORE
value:
name: envoy.filters.http.local_ratelimit
typed_config:
"@type": type.googleapis.com/udpa.type.v1.TypedStruct
type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
value:
stat_prefix: http_local_rate_limiter
token_bucket:
max_tokens: 100
tokens_per_fill: 100
fill_interval: 60s
filter_enabled:
runtime_key: local_rate_limit_enabled
default_value:
numerator: 100
denominator: HUNDRED
filter_enforced:
runtime_key: local_rate_limit_enforced
default_value:
numerator: 100
denominator: HUNDRED
Rate limit service debugging:
```bash # Check rate limit stats kubectl exec app-pod-xxxx -c istio-proxy -- curl -s localhost:15000/stats | \ grep -E "ratelimit|rate_limit"
# Key metrics: # http_local_rate_limiter.ratelimit.ok # http_local_rate_limiter.ratelimit.over_limit # http_local_rate_limiter.ratelimit.enabled
# Test rate limiting for i in {1..150}; do curl -s -o /dev/null -w "%{http_code}\n" http://api-service/health done | sort | uniq -c ```
### 9. Monitor Envoy health metrics
Prometheus metrics export:
yaml
# Scrape config for Envoy metrics
scrape_configs:
- job_name: 'envoy-stats'
metrics_path: /stats/prometheus
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Key Envoy metrics to monitor:
```yaml # Grafana alerting rules for Envoy groups: - name: envoy_sidecar rules: - alert: EnvoySidecarDown expr: up{job="envoy-stats"} == 0 for: 5m labels: severity: critical annotations: summary: "Envoy sidecar is down"
- alert: EnvoyCircuitBreakerTripped
- expr: increase(envoy_cluster_circuit_breakers_default_rq_pending_overflow[5m]) > 0
- for: 1m
- labels:
- severity: warning
- annotations:
- summary: "Envoy circuit breaker tripped for pending requests"
- alert: EnvoyConnectionPoolExhausted
- expr: envoy_cluster_upstream_cx_overflow > 0
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "Envoy connection pool exhausted"
- alert: EnvoyHighErrorRate
- expr: |
- sum(rate(envoy_cluster_upstream_rq{response_code=~"5.."}[5m]))
- / sum(rate(envoy_cluster_upstream_rq[5m])) > 0.05
- for: 5m
- labels:
- severity: critical
- annotations:
- summary: "Envoy upstream error rate above 5%"
- alert: EnvoyxDSConnectionFailed
- expr: envoy_server_hot_restart_epoch == -1
- for: 10m
- labels:
- severity: warning
- annotations:
- summary: "Envoy xDS connection may be stale"
`
### 10. Debug with Envoy access logs
Access log configuration:
yaml
apiVersion: networking.istio.io/v1beta1
kind: EnvoyFilter
metadata:
name: access-log
namespace: istio-system
spec:
configPatches:
- applyTo: NETWORK_FILTER
match:
listener:
filterChain:
filter:
name: "envoy.filters.network.http_connection_manager"
patch:
operation: MERGE
value:
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
access_log:
- name: envoy.access_loggers.file
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: "/dev/stdout"
log_format:
json_format:
time: "%START_TIME%"
method: "%REQ(:METHOD)%"
path: "%REQ(:PATH)%"
protocol: "%PROTOCOL%"
response_code: "%RESPONSE_CODE%"
response_flags: "%RESPONSE_FLAGS%"
bytes_received: "%BYTES_RECEIVED%"
bytes_sent: "%BYTES_SENT%"
duration: "%DURATION%"
upstream_service_time: "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%"
x_forwarded_for: "%REQ(X-FORWARDED-FOR)%"
user_agent: "%REQ(USER-AGENT)%"
request_id: "%REQ(X-REQUEST-ID)%"
authority: "%REQ(:AUTHORITY)%"
upstream_host: "%UPSTREAM_HOST%"
upstream_cluster: "%UPSTREAM_CLUSTER%"
downstream_remote_address: "%DOWNSTREAM_REMOTE_ADDRESS%"
downstream_local_address: "%DOWNSTREAM_LOCAL_ADDRESS%"
Analyze access log patterns:
```bash # Check for error patterns kubectl logs app-pod-xxxx -c istio-proxy | jq -r ' select(.response_code >= 500) | "\(.response_code) \(.upstream_cluster) \(.response_flags)" ' | sort | uniq -c | sort -rn
# Response flags meaning: # - UH: No healthy upstream # - UF: Upstream connection failure # - URX: Upstream reset # - UO: Upstream overflow (circuit breaker) # - UT: Upstream request timeout # - LR: Local reset # - LRU: Local resource exhausted
# Analyze latency distribution kubectl logs app-pod-xxxx -c istio-proxy | jq -r ' select(.duration != null) | .duration / 1000000 # Convert microseconds to ms ' | sort -n | uniq -c | head -20
# Check for specific upstream failures kubectl logs app-pod-xxxx -c istio-proxy | jq -r ' select(.response_flags == "UF") | "\(.upstream_host) \(.upstream_cluster)" ' | sort | uniq -c ```
Prevention
- Monitor xDS sync status with istioctl proxy-status
- Set appropriate circuit breaker thresholds based on load testing
- Configure outlier detection with conservative ejection parameters
- Use connection pooling with HTTP/2 for better resource utilization
- Implement proper health checks for all upstream services
- Set up alerts for circuit breaker trips and connection pool exhaustion
- Use mTLS STRICT mode only after verifying all services have sidecars
- Regular Envoy version upgrades for security patches
Related Errors
- **503 Service Unavailable**: No healthy upstream or circuit breaker open
- **504 Gateway Timeout**: Upstream request timeout
- **Connection Refused**: Upstream service not accepting connections
- **SSL Handshake Failed**: TLS/mTLS configuration mismatch
- **xDS Connection Failed**: Control plane unreachable