What's Actually Happening

Envoy proxy cannot connect to upstream services. Requests fail with connection refused errors, breaking service-to-service communication.

The Error You'll See

Upstream connection failed:

bash
[2024-01-01T00:00:00.000Z] "GET /api/users HTTP/1.1" 503 UC 0 95 0 - "10.0.0.1" "Mozilla/5.0" "abc123" "api-service" "10.0.0.2:8080"

Connection refused:

json
{
  "response_code": 503,
  "response_flags": "UC",
  "upstream_service": "api-service",
  "upstream_host": "10.0.0.2:8080",
  "error": "upstream connect error or disconnect/reset before headers"
}

No healthy upstream:

bash
upstream connect error or disconnect/reset before headers. reset reason: connection refused

Why This Happens

  1. 1.Service not running - Upstream service crashed or not started
  2. 2.Wrong port - Incorrect port in cluster configuration
  3. 3.Health check failing - Endpoints marked unhealthy
  4. 4.DNS resolution - Service name not resolving
  5. 5.Firewall blocking - Network policies blocking traffic
  6. 6.TLS mismatch - Certificate or protocol mismatch

Step 1: Check Envoy Configuration

```bash # Get Envoy config dump: curl http://localhost:9901/config_dump

# Check clusters: curl http://localhost:9901/config_dump | jq '.configs[] | select(.cluster_manager)'

# Check specific cluster: curl http://localhost:9901/config_dump?resource=clusters

# Check listeners: curl http://localhost:9901/config_dump?resource=listeners

# Check routes: curl http://localhost:9901/config_dump | jq '.configs[] | select(.routes)'

# For Istio: kubectl exec -n istio-system deployment/istiod -- pilot-agent request GET config_dump ```

Step 2: Check Cluster Health

```bash # Check cluster status: curl http://localhost:9901/clusters

# Output shows: # api-service::10.0.0.2:8080::health_flags::/failed_active_hc # api-service::10.0.0.2:8080::weight::1

# Look for health flags: # - failed_active_hc: Health check failing # - failed_outlier_check: Outlier detection triggered

# Check endpoint health: curl http://localhost:9901/clusters | grep api-service

# Get endpoint details: curl http://localhost:9901/config_dump | jq '.configs[] | .dynamic_active_clusters[] | select(.name == "api-service")'

# Check outlier detection: curl http://localhost:9901/config_dump | jq '.configs[] | .dynamic_active_clusters[] | .outlier_detection' ```

Step 3: Check Service Endpoints

```bash # For Kubernetes:

# Check service exists: kubectl get svc api-service

# Check endpoints: kubectl get endpoints api-service

# Check pods backing service: kubectl get pods -l app=api-service

# Check pod is running: kubectl describe pod <pod-name>

# Check pod listening on port: kubectl exec <pod-name> -- netstat -tlnp | grep 8080

# Port forward to test: kubectl port-forward svc/api-service 8080:8080 & curl http://localhost:8080/health

# Check service selector matches pods: kubectl get svc api-service -o yaml | grep -A 5 selector kubectl get pods -l app=api-service ```

Step 4: Check Health Checks

```yaml # Envoy health check configuration: clusters: - name: api-service connect_timeout: 5s type: STRICT_DNS lb_policy: ROUND_ROBIN health_checks: - timeout: 5s interval: 10s unhealthy_threshold: 3 healthy_threshold: 1 http_health_check: path: "/health" expected_statuses: start: 200 end: 299

# Verify health check endpoint: curl http://api-service:8080/health

# Check health check logs: curl http://localhost:9901/stats?filter=cluster.api-service.health_check

# Stats: # cluster.api-service.health_check.healthy # cluster.api-service.health_check.unhealthy ```

Step 5: Check Network Connectivity

```bash # Test connectivity from Envoy to upstream: kubectl exec -it <envoy-pod> -- nc -zv api-service 8080

# Check DNS resolution: kubectl exec -it <envoy-pod> -- nslookup api-service

# Test with curl: kubectl exec -it <envoy-pod> -- curl -v http://api-service:8080/health

# Check network policies: kubectl get networkpolicies -A

# Check firewall (on host): iptables -L -n | grep 8080

# Allow traffic: iptables -I INPUT -p tcp --dport 8080 -j ACCEPT

# Check for IP conflicts: arp -a | grep <upstream-ip>

# Test from envoy container: kubectl exec -it <envoy-pod> -- ping api-service ```

Step 6: Check TLS Configuration

```bash # If using TLS upstream:

# Check TLS context: curl http://localhost:9901/config_dump | jq '.configs[] | .dynamic_active_clusters[] | .transport_socket'

# Verify certificates: kubectl exec <envoy-pod> -- openssl s_client -connect api-service:8443 -showcerts

# Test TLS handshake: kubectl exec <envoy-pod> -- curl -k https://api-service:8443/health

# Common TLS issues: # - Certificate expired # - Wrong CA # - SNI mismatch # - Protocol version mismatch

# Check certificate expiry: kubectl exec <envoy-pod> -- openssl s_client -connect api-service:8443 2>/dev/null | openssl x509 -noout -dates ```

Step 7: Check Connection Timeout

```yaml # Envoy cluster timeout configuration: clusters: - name: api-service connect_timeout: 5s # Increase if connection slow

# For slow upstreams: connect_timeout: 30s

# Check timeout stats: curl http://localhost:9901/stats?filter=cluster.api-service.upstream_cx_connect_timeout

# Per-route timeout: routes: - match: prefix: "/api" route: cluster: api-service timeout: 30s

# Check if timeout is the issue: # If UC flag appears immediately, it's connection refused # If timeout occurs, it's network unreachable ```

Step 8: Check Outlier Detection

```yaml # Outlier detection configuration: clusters: - name: api-service outlier_detection: consecutive_5xx: 5 interval: 30s base_ejection_time: 30s max_ejection_percent: 50

# Check outlier status: curl http://localhost:9901/clusters | grep outlier

# Stats: curl http://localhost:9901/stats?filter=outlier_detection

# Reset outliers: # Ejected hosts are automatically reset after base_ejection_time

# Temporarily disable for debugging: # Remove outlier_detection section from config ```

Step 9: Check Envoy Logs

```bash # Check Envoy access logs: kubectl logs <envoy-pod> -c istio-proxy | grep 503

# Check Envoy error logs: kubectl logs <envoy-pod> -c istio-proxy | grep -i "error|warning"

# Enable debug logging: curl -X POST http://localhost:9901/logging?level=debug

# Watch logs in real-time: kubectl logs -f <envoy-pod> -c istio-proxy | grep -i upstream

# Common log patterns: # - "upstream connect error" # - "connection refused" # - "health check failed" # - "no healthy upstream"

# For Istio: istioctl proxy-config cluster <pod-name> -n <namespace> istioctl proxy-config endpoint <pod-name> -n <namespace> ```

Step 10: Monitor Envoy Metrics

```bash # Key Envoy metrics: curl http://localhost:9901/stats/prometheus | grep -E "upstream_cx|cluster.*api"

# Important metrics: # - envoy_cluster_upstream_cx_connect_fail # - envoy_cluster_upstream_cx_connect_timeout # - envoy_cluster_health_check_failure # - envoy_cluster_upstream_rq_5xx

# Create monitoring script: cat << 'EOF' > /usr/local/bin/monitor-envoy.sh #!/bin/bash

echo "=== Cluster Status ===" curl -s http://localhost:9901/clusters

echo "" echo "=== Upstream Connection Errors ===" curl -s http://localhost:9901/stats | grep upstream_cx_connect_fail

echo "" echo "=== Health Check Status ===" curl -s http://localhost:9901/stats | grep health_check

echo "" echo "=== 5xx Responses ===" curl -s http://localhost:9901/stats | grep "_5xx"

echo "" echo "=== Active Connections ===" curl -s http://localhost:9901/stats | grep upstream_cx_active EOF

chmod +x /usr/local/bin/monitor-envoy.sh

# Prometheus alerts: - alert: EnvoyUpstreamConnectionFailed expr: rate(envoy_cluster_upstream_cx_connect_fail[5m]) > 0 for: 2m labels: severity: warning annotations: summary: "Envoy upstream connection failures"

  • alert: EnvoyNoHealthyUpstream
  • expr: envoy_cluster_health_check_healthy == 0
  • for: 2m
  • labels:
  • severity: critical
  • annotations:
  • summary: "Envoy has no healthy upstream hosts"
  • `

Envoy Upstream Connection Checklist

CheckCommandExpected
Cluster configconfig_dumpCorrect hosts
EndpointsclustersHealthy
Health check/health endpoint200 OK
Connectivitync -zvConnected
DNSnslookupResolved
TLSopenssl s_clientValid cert

Verify the Fix

```bash # After fixing upstream connection

# 1. Check cluster healthy curl http://localhost:9901/clusters | grep api-service // No failed_active_hc

# 2. Test upstream request curl http://localhost:9901/stats | grep upstream_rq_200 // Requests succeeding

# 3. Check no connection errors curl http://localhost:9901/stats | grep connect_fail // 0 failures

# 4. Verify health checks curl http://localhost:9901/stats | grep health_check.healthy // > 0

# 5. Test from client curl http://api-gateway/api/users // Response 200

# 6. Monitor ongoing /usr/local/bin/monitor-envoy.sh // No errors ```

  • [Fix HAProxy Backend Down](/articles/fix-haproxy-backend-down)
  • [Fix Traefik Routing Not Working](/articles/fix-traefik-routing-not-working)
  • [Fix Kubernetes Service Not Accessible](/articles/fix-kubernetes-service-not-accessible)