Introduction
Envoy upstream connect timeout occurs when the proxy cannot establish a connection to the backend cluster within the configured deadline. This manifests as UF (upstream failure) or UO (upstream overflow) in access logs, with HTTP 503 or 504 responses to clients. The error indicates connectivity issues between Envoy and the backend service, not application-level problems.
Symptoms
- Access logs show
UF,UO, orURX(upstream retry exhausted) flags - HTTP 503 Service Unavailable or 504 Gateway Timeout responses
- Metrics show
upstream_cx_connect_timeoutincreasing - Requests succeed when retrying (indicates transient network or health check issue)
- Issue appears after service mesh deployment, mTLS policy change, or network policy update
Common Causes
- Cluster timeout thresholds too aggressive for backend startup time
- Circuit breaker tripped due to
max_connectionsormax_pending_requestslimits - Health checks failing, causing endpoints to be ejected from the cluster
- mTLS handshake timeout when STRICT mode requires client certificates
- Network policies or service mesh authorization blocking egress traffic
- DNS resolution delays for logical cluster names
- Backend service not running or listening on expected port
Step-by-Step Fix
### 1. Check Envoy access logs for failure codes
Envoy access logs reveal the specific failure reason:
```bash # Kubernetes: get Envoy sidecar logs kubectl logs -n <namespace> <pod-name> -c istio-proxy --tail=100 | grep -E "UF|UO|URX"
# Standalone Envoy: check access log tail -f /var/log/envoy/access.log | grep -E "503|504" ```
Key failure flags:
- UF (Upstream Failure): Connection failed before data sent
- UO (Upstream Overflow): Circuit breaker tripped, no connections available
- URX (Upstream Retry Exhausted): All retry attempts failed
- UT (Upstream Timeout): Response took longer than per_try_timeout
### 2. Verify cluster configuration and timeouts
Inspect the cluster definition for timeout settings:
```bash # Envoy admin API - get cluster config curl -s http://localhost:15000/config_dump?include_eds | jq '.configs[] | select(.cluster.name == "outbound|8080||service.default.svc.cluster.local")'
# Check cluster statistics curl -s http://localhost:15000/clusters | grep -A5 "service.default.svc.cluster.local" ```
Expected timeout configuration:
```yaml # Cluster timeout settings connect_timeout: 5s # TCP connection establishment per_try_timeout: 3s # Per retry attempt deadline request_timeout: 30s # Total request deadline idle_timeout: 3600s # Idle connection keep-alive
# Health check configuration health_checks: - timeout: 5s interval: 10s unhealthy_threshold: 3 healthy_threshold: 2 http_health_check: path: /healthz ```
For slow-starting services (JVM, .NET), increase connect_timeout to 10-30s.
### 3. Check circuit breaker configuration
Circuit breakers prevent cascade failures but can cause UO errors:
```bash # Get circuit breaker stats curl -s http://localhost:15000/stats | grep -E "circuit_breakers|rq_pending|cx_open"
# Example metrics to check cluster.<name>.circuit_breakers.default.rq_pending_open cluster.<name>.circuit_breakers.default.cx_open cluster.<name>.circuit_breakers.default.rq_retry_open ```
Default Envoy circuit breaker limits:
yaml
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 1024 # Simultaneous TCP connections
max_pending_requests: 1024 # Queued requests awaiting connection
max_requests: 1024 # Active requests (connection + response)
max_retries: 3 # Concurrent retry attempts
Increase limits for high-throughput services:
yaml
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 10000
max_pending_requests: 5000
max_requests: 10000
### 4. Verify health check endpoint responses
Failed health checks eject endpoints from the cluster:
```bash # Check endpoint health status curl -s http://localhost:15000/clusters | grep -E "healthy|unhealthy"
# Kubernetes: check endpoint slices kubectl get endpointslices -n <namespace> -l kubernetes.io/service-name=<service-name> -o yaml
# Test health endpoint directly from sidecar kubectl exec -n <namespace> <pod-name> -c istio-proxy -- curl -s http://<backend-pod-ip>:8080/healthz ```
Expected: Endpoints marked as healthy with recent last_healthcheck timestamp.
### 5. Check mTLS and peer certificate validation
In STRICT mTLS mode, connection timeouts can indicate certificate issues:
```bash # Check mTLS policy kubectl get peerauthentication -n <namespace> default -o yaml
# Expected for STRICT mode apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: <namespace> spec: mtls: mode: STRICT ```
Verify certificate chain and expiration:
```bash # From sidecar, check certificate kubectl exec -n <namespace> <pod-name> -c istio-proxy -- cat /etc/certs/cert-chain.pem | openssl x509 -text -noout | grep -E "Not Before|Not After"
# Check root CA trust kubectl exec -n <namespace> <pod-name> -c istio-proxy -- openssl verify -CAfile /etc/certs/root-cert.pem /etc/certs/cert-chain.pem ```
If certificate validation fails, the TLS handshake times out instead of completing.
### 6. Verify network policies allow egress traffic
Network policies can silently block Envoy egress:
```bash # Check network policies in namespace kubectl get networkpolicy -n <namespace>
# Test egress connectivity from sidecar kubectl exec -n <namespace> <pod-name> -c istio-proxy -- nc -zv <backend-service>.<namespace>.svc.cluster.local 8080
# Check if egress is allowed kubectl exec -n <namespace> <pod-name> -c istio-proxy -- curl -v http://<backend-pod-ip>:8080/healthz ```
If network policy blocks traffic, add egress rule:
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-envoy-egress
namespace: <namespace>
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: <backend-namespace>
ports:
- protocol: TCP
port: 8080
### 7. Check DNS resolution for cluster endpoints
Envoy resolves logical service names via DNS:
```bash # Check DNS resolution from sidecar kubectl exec -n <namespace> <pod-name> -c istio-proxy -- nslookup <service>.<namespace>.svc.cluster.local
# Check CoreDNS logs for resolution errors kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# Verify endpoint DNS TTL kubectl exec -n <namespace> <pod-name> -c istio-proxy -- cat /etc/resolv.conf ```
Expected: Resolution time < 100ms, TTL aligned with endpoint update frequency.
### 8. Analyze retry configuration
Insufficient retries can cause transient failures to surface as timeouts:
yaml
# VirtualService retry configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: <service-name>
spec:
hosts:
- <service-name>
http:
- route:
- destination:
host: <service-name>
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,reset,connect-failure,retriable-4xx
Recommended settings for production:
yaml
retries:
attempts: 3
perTryTimeout: 3s
retryOn: gateway-error,connect-failure,refused-stream,retriable-4xx
retryBudget:
budgetPercent: 20
minRetryConns: 3
### 9. Check for connection pool exhaustion
Monitor active and pending connections:
```bash # Get connection pool stats curl -s http://localhost:15000/stats | grep -E "cx_active|cx_pending|rq_active"
# Key metrics cluster.<name>.upstream_cx_active # Active TCP connections cluster.<name>.upstream_cx_pending # Waiting for connection cluster.<name>.upstream_rq_active # Active requests cluster.<name>.upstream_rq_pending # Waiting to be sent ```
If upstream_cx_pending is high, increase connection pool:
yaml
# DestinationRule connection pool settings
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: <service-name>
spec:
host: <service-name>
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1000
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 1000
http2MaxRequests: 1000
maxRequestsPerConnection: 100
### 10. Verify backend service binding
Backend must listen on the expected address and port:
```bash # Check what address backend binds to kubectl exec -n <namespace> <backend-pod> -- netstat -tlnp | grep :8080
# Expected: 0.0.0.0:8080 or :::8080 (all interfaces) # Problematic: 127.0.0.1:8080 (localhost only, not reachable from sidecar)
# Check service port mapping kubectl get svc <service-name> -n <namespace> -o yaml
# Verify port matches container port spec: ports: - port: 8080 # Service port targetPort: 8080 # Container port (must match app binding) ```
Prevention
- Set
connect_timeoutto 3x p99 connection establishment time - Configure circuit breaker limits based on load testing, not defaults
- Implement
/healthzendpoint that returns within 1s - Monitor
upstream_cx_connect_timeoutas leading indicator - Use
PERMIT_WITHOUT_SERVICEmTLS mode during service startup grace period - Deploy with readiness gates that verify sidecar injection complete
- Test timeout behavior under failure scenarios in staging
Related Errors
- **503 Service Unavailable**: Circuit breaker open or no healthy endpoints
- **504 Gateway Timeout**: Response exceeded
per_try_timeoutorrequest_timeout - **Connection reset by peer**: Backend closed connection before response complete