Fix Envoy Upstream Connection Refused - Troubleshooting Guide

What's Actually Happening

Envoy proxy cannot connect to upstream services. Requests fail with connection refused errors, breaking service-to-service communication.

The Error You'll See

Upstream connection failed:

bash

[2024-01-01T00:00:00.000Z] "GET /api/users HTTP/1.1" 503 UC 0 95 0 - "10.0.0.1" "Mozilla/5.0" "abc123" "api-service" "10.0.0.2:8080"

Connection refused:

json

{
  "response_code": 503,
  "response_flags": "UC",
  "upstream_service": "api-service",
  "upstream_host": "10.0.0.2:8080",
  "error": "upstream connect error or disconnect/reset before headers"
}

No healthy upstream:

bash

upstream connect error or disconnect/reset before headers. reset reason: connection refused

Why This Happens

1.Service not running - Upstream service crashed or not started
2.Wrong port - Incorrect port in cluster configuration
3.Health check failing - Endpoints marked unhealthy
4.DNS resolution - Service name not resolving
5.Firewall blocking - Network policies blocking traffic
6.TLS mismatch - Certificate or protocol mismatch

Step 1: Check Envoy Configuration

```bash # Get Envoy config dump: curl http://localhost:9901/config_dump

# Check clusters: curl http://localhost:9901/config_dump | jq '.configs[] | select(.cluster_manager)'

# Check specific cluster: curl http://localhost:9901/config_dump?resource=clusters

# Check listeners: curl http://localhost:9901/config_dump?resource=listeners

# Check routes: curl http://localhost:9901/config_dump | jq '.configs[] | select(.routes)'

# For Istio: kubectl exec -n istio-system deployment/istiod -- pilot-agent request GET config_dump ```

Step 2: Check Cluster Health

```bash # Check cluster status: curl http://localhost:9901/clusters

# Output shows: # api-service::10.0.0.2:8080::health_flags::/failed_active_hc # api-service::10.0.0.2:8080::weight::1

# Look for health flags: # - failed_active_hc: Health check failing # - failed_outlier_check: Outlier detection triggered

# Check endpoint health: curl http://localhost:9901/clusters | grep api-service

# Get endpoint details: curl http://localhost:9901/config_dump | jq '.configs[] | .dynamic_active_clusters[] | select(.name == "api-service")'

# Check outlier detection: curl http://localhost:9901/config_dump | jq '.configs[] | .dynamic_active_clusters[] | .outlier_detection' ```

Step 3: Check Service Endpoints

```bash # For Kubernetes:

# Check service exists: kubectl get svc api-service

# Check endpoints: kubectl get endpoints api-service

# Check pods backing service: kubectl get pods -l app=api-service

# Check pod is running: kubectl describe pod <pod-name>

# Check pod listening on port: kubectl exec <pod-name> -- netstat -tlnp | grep 8080

# Port forward to test: kubectl port-forward svc/api-service 8080:8080 & curl http://localhost:8080/health

# Check service selector matches pods: kubectl get svc api-service -o yaml | grep -A 5 selector kubectl get pods -l app=api-service ```

Step 4: Check Health Checks

```yaml # Envoy health check configuration: clusters: - name: api-service connect_timeout: 5s type: STRICT_DNS lb_policy: ROUND_ROBIN health_checks: - timeout: 5s interval: 10s unhealthy_threshold: 3 healthy_threshold: 1 http_health_check: path: "/health" expected_statuses: start: 200 end: 299

# Verify health check endpoint: curl http://api-service:8080/health

# Check health check logs: curl http://localhost:9901/stats?filter=cluster.api-service.health_check

# Stats: # cluster.api-service.health_check.healthy # cluster.api-service.health_check.unhealthy ```

Step 5: Check Network Connectivity

```bash # Test connectivity from Envoy to upstream: kubectl exec -it <envoy-pod> -- nc -zv api-service 8080

# Check DNS resolution: kubectl exec -it <envoy-pod> -- nslookup api-service

# Test with curl: kubectl exec -it <envoy-pod> -- curl -v http://api-service:8080/health

# Check network policies: kubectl get networkpolicies -A

# Check firewall (on host): iptables -L -n | grep 8080

# Allow traffic: iptables -I INPUT -p tcp --dport 8080 -j ACCEPT

# Check for IP conflicts: arp -a | grep <upstream-ip>

# Test from envoy container: kubectl exec -it <envoy-pod> -- ping api-service ```

Step 6: Check TLS Configuration

```bash # If using TLS upstream:

# Check TLS context: curl http://localhost:9901/config_dump | jq '.configs[] | .dynamic_active_clusters[] | .transport_socket'

# Verify certificates: kubectl exec <envoy-pod> -- openssl s_client -connect api-service:8443 -showcerts

# Test TLS handshake: kubectl exec <envoy-pod> -- curl -k https://api-service:8443/health

# Common TLS issues: # - Certificate expired # - Wrong CA # - SNI mismatch # - Protocol version mismatch

# Check certificate expiry: kubectl exec <envoy-pod> -- openssl s_client -connect api-service:8443 2>/dev/null | openssl x509 -noout -dates ```

Step 7: Check Connection Timeout

```yaml # Envoy cluster timeout configuration: clusters: - name: api-service connect_timeout: 5s # Increase if connection slow

# For slow upstreams: connect_timeout: 30s

# Check timeout stats: curl http://localhost:9901/stats?filter=cluster.api-service.upstream_cx_connect_timeout

# Per-route timeout: routes: - match: prefix: "/api" route: cluster: api-service timeout: 30s

# Check if timeout is the issue: # If UC flag appears immediately, it's connection refused # If timeout occurs, it's network unreachable ```

Step 8: Check Outlier Detection

```yaml # Outlier detection configuration: clusters: - name: api-service outlier_detection: consecutive_5xx: 5 interval: 30s base_ejection_time: 30s max_ejection_percent: 50

# Check outlier status: curl http://localhost:9901/clusters | grep outlier

# Stats: curl http://localhost:9901/stats?filter=outlier_detection

# Reset outliers: # Ejected hosts are automatically reset after base_ejection_time

# Temporarily disable for debugging: # Remove outlier_detection section from config ```

Step 9: Check Envoy Logs

```bash # Check Envoy access logs: kubectl logs <envoy-pod> -c istio-proxy | grep 503

# Check Envoy error logs: kubectl logs <envoy-pod> -c istio-proxy | grep -i "error|warning"

# Enable debug logging: curl -X POST http://localhost:9901/logging?level=debug

# Watch logs in real-time: kubectl logs -f <envoy-pod> -c istio-proxy | grep -i upstream

# Common log patterns: # - "upstream connect error" # - "connection refused" # - "health check failed" # - "no healthy upstream"

# For Istio: istioctl proxy-config cluster <pod-name> -n <namespace> istioctl proxy-config endpoint <pod-name> -n <namespace> ```

Step 10: Monitor Envoy Metrics

```bash # Key Envoy metrics: curl http://localhost:9901/stats/prometheus | grep -E "upstream_cx|cluster.*api"

# Important metrics: # - envoy_cluster_upstream_cx_connect_fail # - envoy_cluster_upstream_cx_connect_timeout # - envoy_cluster_health_check_failure # - envoy_cluster_upstream_rq_5xx

# Create monitoring script: cat << 'EOF' > /usr/local/bin/monitor-envoy.sh #!/bin/bash

echo "=== Cluster Status ===" curl -s http://localhost:9901/clusters

echo "" echo "=== Upstream Connection Errors ===" curl -s http://localhost:9901/stats | grep upstream_cx_connect_fail

echo "" echo "=== Health Check Status ===" curl -s http://localhost:9901/stats | grep health_check

echo "" echo "=== 5xx Responses ===" curl -s http://localhost:9901/stats | grep "_5xx"

echo "" echo "=== Active Connections ===" curl -s http://localhost:9901/stats | grep upstream_cx_active EOF

chmod +x /usr/local/bin/monitor-envoy.sh

# Prometheus alerts: - alert: EnvoyUpstreamConnectionFailed expr: rate(envoy_cluster_upstream_cx_connect_fail[5m]) > 0 for: 2m labels: severity: warning annotations: summary: "Envoy upstream connection failures"

alert: EnvoyNoHealthyUpstream
expr: envoy_cluster_health_check_healthy == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Envoy has no healthy upstream hosts"
`

Envoy Upstream Connection Checklist

Check	Command	Expected
Cluster config	config_dump	Correct hosts
Endpoints	clusters	Healthy
Health check	/health endpoint	200 OK
Connectivity	nc -zv	Connected
DNS	nslookup	Resolved
TLS	openssl s_client	Valid cert

Verify the Fix

```bash # After fixing upstream connection

# 1. Check cluster healthy curl http://localhost:9901/clusters | grep api-service // No failed_active_hc

# 2. Test upstream request curl http://localhost:9901/stats | grep upstream_rq_200 // Requests succeeding

# 3. Check no connection errors curl http://localhost:9901/stats | grep connect_fail // 0 failures

# 4. Verify health checks curl http://localhost:9901/stats | grep health_check.healthy // > 0

# 5. Test from client curl http://api-gateway/api/users // Response 200

# 6. Monitor ongoing /usr/local/bin/monitor-envoy.sh // No errors ```

[Fix HAProxy Backend Down](/articles/fix-haproxy-backend-down)
[Fix Traefik Routing Not Working](/articles/fix-traefik-routing-not-working)
[Fix Kubernetes Service Not Accessible](/articles/fix-kubernetes-service-not-accessible)

Fix Envoy Proxy Upstream Connection Refused

What's Actually Happening

The Error You'll See

Why This Happens

Step 1: Check Envoy Configuration

Step 2: Check Cluster Health

Step 3: Check Service Endpoints

Step 4: Check Health Checks

Step 5: Check Network Connectivity

Step 6: Check TLS Configuration

Step 7: Check Connection Timeout

Step 8: Check Outlier Detection

Step 9: Check Envoy Logs

Step 10: Monitor Envoy Metrics

Envoy Upstream Connection Checklist

Verify the Fix

Related Issues

Share this guide

More Service Mesh Troubleshooting Guides

Service Mesh OpenTelemetry Not Sending

Service Mesh StatsD Exporter Failed

Service Mesh DNS Resolution Failed

Service Mesh External Traffic Not Routing

Service Mesh Protocol Detection Wrong

Service Mesh Certificate Expired