Fix Envoy Upstream Connect Timeout

Introduction

Envoy upstream connect timeout occurs when the proxy cannot establish a connection to the backend cluster within the configured deadline. This manifests as UF (upstream failure) or UO (upstream overflow) in access logs, with HTTP 503 or 504 responses to clients. The error indicates connectivity issues between Envoy and the backend service, not application-level problems.

Symptoms

Access logs show UF, UO, or URX (upstream retry exhausted) flags
HTTP 503 Service Unavailable or 504 Gateway Timeout responses
Metrics show upstream_cx_connect_timeout increasing
Requests succeed when retrying (indicates transient network or health check issue)
Issue appears after service mesh deployment, mTLS policy change, or network policy update

Common Causes

Cluster timeout thresholds too aggressive for backend startup time
Circuit breaker tripped due to max_connections or max_pending_requests limits
Health checks failing, causing endpoints to be ejected from the cluster
mTLS handshake timeout when STRICT mode requires client certificates
Network policies or service mesh authorization blocking egress traffic
DNS resolution delays for logical cluster names
Backend service not running or listening on expected port

Step-by-Step Fix

### 1. Check Envoy access logs for failure codes

Envoy access logs reveal the specific failure reason:

```bash # Kubernetes: get Envoy sidecar logs kubectl logs -n <namespace> <pod-name> -c istio-proxy --tail=100 | grep -E "UF|UO|URX"

# Standalone Envoy: check access log tail -f /var/log/envoy/access.log | grep -E "503|504" ```

Key failure flags: - UF (Upstream Failure): Connection failed before data sent - UO (Upstream Overflow): Circuit breaker tripped, no connections available - URX (Upstream Retry Exhausted): All retry attempts failed - UT (Upstream Timeout): Response took longer than per_try_timeout

### 2. Verify cluster configuration and timeouts

Inspect the cluster definition for timeout settings:

```bash # Envoy admin API - get cluster config curl -s http://localhost:15000/config_dump?include_eds | jq '.configs[] | select(.cluster.name == "outbound|8080||service.default.svc.cluster.local")'

# Check cluster statistics curl -s http://localhost:15000/clusters | grep -A5 "service.default.svc.cluster.local" ```

Expected timeout configuration:

```yaml # Cluster timeout settings connect_timeout: 5s # TCP connection establishment per_try_timeout: 3s # Per retry attempt deadline request_timeout: 30s # Total request deadline idle_timeout: 3600s # Idle connection keep-alive

# Health check configuration health_checks: - timeout: 5s interval: 10s unhealthy_threshold: 3 healthy_threshold: 2 http_health_check: path: /healthz ```

For slow-starting services (JVM, .NET), increase connect_timeout to 10-30s.

### 3. Check circuit breaker configuration

Circuit breakers prevent cascade failures but can cause UO errors:

```bash # Get circuit breaker stats curl -s http://localhost:15000/stats | grep -E "circuit_breakers|rq_pending|cx_open"

# Example metrics to check cluster.<name>.circuit_breakers.default.rq_pending_open cluster.<name>.circuit_breakers.default.cx_open cluster.<name>.circuit_breakers.default.rq_retry_open ```

Default Envoy circuit breaker limits:

yaml circuit_breakers: thresholds: - priority: DEFAULT max_connections: 1024 # Simultaneous TCP connections max_pending_requests: 1024 # Queued requests awaiting connection max_requests: 1024 # Active requests (connection + response) max_retries: 3 # Concurrent retry attempts

Increase limits for high-throughput services:

yaml circuit_breakers: thresholds: - priority: DEFAULT max_connections: 10000 max_pending_requests: 5000 max_requests: 10000

### 4. Verify health check endpoint responses

Failed health checks eject endpoints from the cluster:

```bash # Check endpoint health status curl -s http://localhost:15000/clusters | grep -E "healthy|unhealthy"

# Kubernetes: check endpoint slices kubectl get endpointslices -n <namespace> -l kubernetes.io/service-name=<service-name> -o yaml

# Test health endpoint directly from sidecar kubectl exec -n <namespace> <pod-name> -c istio-proxy -- curl -s http://<backend-pod-ip>:8080/healthz ```

Expected: Endpoints marked as healthy with recent last_healthcheck timestamp.

### 5. Check mTLS and peer certificate validation

In STRICT mTLS mode, connection timeouts can indicate certificate issues:

```bash # Check mTLS policy kubectl get peerauthentication -n <namespace> default -o yaml

# Expected for STRICT mode apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: <namespace> spec: mtls: mode: STRICT ```

Verify certificate chain and expiration:

```bash # From sidecar, check certificate kubectl exec -n <namespace> <pod-name> -c istio-proxy -- cat /etc/certs/cert-chain.pem | openssl x509 -text -noout | grep -E "Not Before|Not After"

# Check root CA trust kubectl exec -n <namespace> <pod-name> -c istio-proxy -- openssl verify -CAfile /etc/certs/root-cert.pem /etc/certs/cert-chain.pem ```

If certificate validation fails, the TLS handshake times out instead of completing.

### 6. Verify network policies allow egress traffic

Network policies can silently block Envoy egress:

```bash # Check network policies in namespace kubectl get networkpolicy -n <namespace>

# Test egress connectivity from sidecar kubectl exec -n <namespace> <pod-name> -c istio-proxy -- nc -zv <backend-service>.<namespace>.svc.cluster.local 8080

# Check if egress is allowed kubectl exec -n <namespace> <pod-name> -c istio-proxy -- curl -v http://<backend-pod-ip>:8080/healthz ```

If network policy blocks traffic, add egress rule:

yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-envoy-egress namespace: <namespace> spec: podSelector: {} policyTypes: - Egress egress: - to: - namespaceSelector: matchLabels: name: <backend-namespace> ports: - protocol: TCP port: 8080

### 7. Check DNS resolution for cluster endpoints

Envoy resolves logical service names via DNS:

```bash # Check DNS resolution from sidecar kubectl exec -n <namespace> <pod-name> -c istio-proxy -- nslookup <service>.<namespace>.svc.cluster.local

# Check CoreDNS logs for resolution errors kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Verify endpoint DNS TTL kubectl exec -n <namespace> <pod-name> -c istio-proxy -- cat /etc/resolv.conf ```

Expected: Resolution time < 100ms, TTL aligned with endpoint update frequency.

### 8. Analyze retry configuration

Insufficient retries can cause transient failures to surface as timeouts:

yaml # VirtualService retry configuration apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: <service-name> spec: hosts: - <service-name> http: - route: - destination: host: <service-name> retries: attempts: 3 perTryTimeout: 2s retryOn: 5xx,reset,connect-failure,retriable-4xx

Recommended settings for production:

yaml retries: attempts: 3 perTryTimeout: 3s retryOn: gateway-error,connect-failure,refused-stream,retriable-4xx retryBudget: budgetPercent: 20 minRetryConns: 3

### 9. Check for connection pool exhaustion

Monitor active and pending connections:

```bash # Get connection pool stats curl -s http://localhost:15000/stats | grep -E "cx_active|cx_pending|rq_active"

# Key metrics cluster.<name>.upstream_cx_active # Active TCP connections cluster.<name>.upstream_cx_pending # Waiting for connection cluster.<name>.upstream_rq_active # Active requests cluster.<name>.upstream_rq_pending # Waiting to be sent ```

If upstream_cx_pending is high, increase connection pool:

yaml # DestinationRule connection pool settings apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: <service-name> spec: host: <service-name> trafficPolicy: connectionPool: tcp: maxConnections: 1000 http: h2UpgradePolicy: UPGRADE http1MaxPendingRequests: 1000 http2MaxRequests: 1000 maxRequestsPerConnection: 100

### 10. Verify backend service binding

Backend must listen on the expected address and port:

```bash # Check what address backend binds to kubectl exec -n <namespace> <backend-pod> -- netstat -tlnp | grep :8080

# Expected: 0.0.0.0:8080 or :::8080 (all interfaces) # Problematic: 127.0.0.1:8080 (localhost only, not reachable from sidecar)

# Check service port mapping kubectl get svc <service-name> -n <namespace> -o yaml

# Verify port matches container port spec: ports: - port: 8080 # Service port targetPort: 8080 # Container port (must match app binding) ```

Prevention

Set connect_timeout to 3x p99 connection establishment time
Configure circuit breaker limits based on load testing, not defaults
Implement /healthz endpoint that returns within 1s
Monitor upstream_cx_connect_timeout as leading indicator
Use PERMIT_WITHOUT_SERVICE mTLS mode during service startup grace period
Deploy with readiness gates that verify sidecar injection complete
Test timeout behavior under failure scenarios in staging

**503 Service Unavailable**: Circuit breaker open or no healthy endpoints
**504 Gateway Timeout**: Response exceeded per_try_timeout or request_timeout
**Connection reset by peer**: Backend closed connection before response complete

How to Fix Envoy Upstream Connect Timeout

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Related Errors

Share this guide