What's Actually Happening
Envoy proxy cannot connect to upstream services. Requests fail with connection refused errors, breaking service-to-service communication.
The Error You'll See
Upstream connection failed:
[2024-01-01T00:00:00.000Z] "GET /api/users HTTP/1.1" 503 UC 0 95 0 - "10.0.0.1" "Mozilla/5.0" "abc123" "api-service" "10.0.0.2:8080"Connection refused:
{
"response_code": 503,
"response_flags": "UC",
"upstream_service": "api-service",
"upstream_host": "10.0.0.2:8080",
"error": "upstream connect error or disconnect/reset before headers"
}No healthy upstream:
upstream connect error or disconnect/reset before headers. reset reason: connection refusedWhy This Happens
- 1.Service not running - Upstream service crashed or not started
- 2.Wrong port - Incorrect port in cluster configuration
- 3.Health check failing - Endpoints marked unhealthy
- 4.DNS resolution - Service name not resolving
- 5.Firewall blocking - Network policies blocking traffic
- 6.TLS mismatch - Certificate or protocol mismatch
Step 1: Check Envoy Configuration
```bash # Get Envoy config dump: curl http://localhost:9901/config_dump
# Check clusters: curl http://localhost:9901/config_dump | jq '.configs[] | select(.cluster_manager)'
# Check specific cluster: curl http://localhost:9901/config_dump?resource=clusters
# Check listeners: curl http://localhost:9901/config_dump?resource=listeners
# Check routes: curl http://localhost:9901/config_dump | jq '.configs[] | select(.routes)'
# For Istio: kubectl exec -n istio-system deployment/istiod -- pilot-agent request GET config_dump ```
Step 2: Check Cluster Health
```bash # Check cluster status: curl http://localhost:9901/clusters
# Output shows: # api-service::10.0.0.2:8080::health_flags::/failed_active_hc # api-service::10.0.0.2:8080::weight::1
# Look for health flags: # - failed_active_hc: Health check failing # - failed_outlier_check: Outlier detection triggered
# Check endpoint health: curl http://localhost:9901/clusters | grep api-service
# Get endpoint details: curl http://localhost:9901/config_dump | jq '.configs[] | .dynamic_active_clusters[] | select(.name == "api-service")'
# Check outlier detection: curl http://localhost:9901/config_dump | jq '.configs[] | .dynamic_active_clusters[] | .outlier_detection' ```
Step 3: Check Service Endpoints
```bash # For Kubernetes:
# Check service exists: kubectl get svc api-service
# Check endpoints: kubectl get endpoints api-service
# Check pods backing service: kubectl get pods -l app=api-service
# Check pod is running: kubectl describe pod <pod-name>
# Check pod listening on port: kubectl exec <pod-name> -- netstat -tlnp | grep 8080
# Port forward to test: kubectl port-forward svc/api-service 8080:8080 & curl http://localhost:8080/health
# Check service selector matches pods: kubectl get svc api-service -o yaml | grep -A 5 selector kubectl get pods -l app=api-service ```
Step 4: Check Health Checks
```yaml # Envoy health check configuration: clusters: - name: api-service connect_timeout: 5s type: STRICT_DNS lb_policy: ROUND_ROBIN health_checks: - timeout: 5s interval: 10s unhealthy_threshold: 3 healthy_threshold: 1 http_health_check: path: "/health" expected_statuses: start: 200 end: 299
# Verify health check endpoint: curl http://api-service:8080/health
# Check health check logs: curl http://localhost:9901/stats?filter=cluster.api-service.health_check
# Stats: # cluster.api-service.health_check.healthy # cluster.api-service.health_check.unhealthy ```
Step 5: Check Network Connectivity
```bash # Test connectivity from Envoy to upstream: kubectl exec -it <envoy-pod> -- nc -zv api-service 8080
# Check DNS resolution: kubectl exec -it <envoy-pod> -- nslookup api-service
# Test with curl: kubectl exec -it <envoy-pod> -- curl -v http://api-service:8080/health
# Check network policies: kubectl get networkpolicies -A
# Check firewall (on host): iptables -L -n | grep 8080
# Allow traffic: iptables -I INPUT -p tcp --dport 8080 -j ACCEPT
# Check for IP conflicts: arp -a | grep <upstream-ip>
# Test from envoy container: kubectl exec -it <envoy-pod> -- ping api-service ```
Step 6: Check TLS Configuration
```bash # If using TLS upstream:
# Check TLS context: curl http://localhost:9901/config_dump | jq '.configs[] | .dynamic_active_clusters[] | .transport_socket'
# Verify certificates: kubectl exec <envoy-pod> -- openssl s_client -connect api-service:8443 -showcerts
# Test TLS handshake: kubectl exec <envoy-pod> -- curl -k https://api-service:8443/health
# Common TLS issues: # - Certificate expired # - Wrong CA # - SNI mismatch # - Protocol version mismatch
# Check certificate expiry: kubectl exec <envoy-pod> -- openssl s_client -connect api-service:8443 2>/dev/null | openssl x509 -noout -dates ```
Step 7: Check Connection Timeout
```yaml # Envoy cluster timeout configuration: clusters: - name: api-service connect_timeout: 5s # Increase if connection slow
# For slow upstreams: connect_timeout: 30s
# Check timeout stats: curl http://localhost:9901/stats?filter=cluster.api-service.upstream_cx_connect_timeout
# Per-route timeout: routes: - match: prefix: "/api" route: cluster: api-service timeout: 30s
# Check if timeout is the issue: # If UC flag appears immediately, it's connection refused # If timeout occurs, it's network unreachable ```
Step 8: Check Outlier Detection
```yaml # Outlier detection configuration: clusters: - name: api-service outlier_detection: consecutive_5xx: 5 interval: 30s base_ejection_time: 30s max_ejection_percent: 50
# Check outlier status: curl http://localhost:9901/clusters | grep outlier
# Stats: curl http://localhost:9901/stats?filter=outlier_detection
# Reset outliers: # Ejected hosts are automatically reset after base_ejection_time
# Temporarily disable for debugging: # Remove outlier_detection section from config ```
Step 9: Check Envoy Logs
```bash # Check Envoy access logs: kubectl logs <envoy-pod> -c istio-proxy | grep 503
# Check Envoy error logs: kubectl logs <envoy-pod> -c istio-proxy | grep -i "error|warning"
# Enable debug logging: curl -X POST http://localhost:9901/logging?level=debug
# Watch logs in real-time: kubectl logs -f <envoy-pod> -c istio-proxy | grep -i upstream
# Common log patterns: # - "upstream connect error" # - "connection refused" # - "health check failed" # - "no healthy upstream"
# For Istio: istioctl proxy-config cluster <pod-name> -n <namespace> istioctl proxy-config endpoint <pod-name> -n <namespace> ```
Step 10: Monitor Envoy Metrics
```bash # Key Envoy metrics: curl http://localhost:9901/stats/prometheus | grep -E "upstream_cx|cluster.*api"
# Important metrics: # - envoy_cluster_upstream_cx_connect_fail # - envoy_cluster_upstream_cx_connect_timeout # - envoy_cluster_health_check_failure # - envoy_cluster_upstream_rq_5xx
# Create monitoring script: cat << 'EOF' > /usr/local/bin/monitor-envoy.sh #!/bin/bash
echo "=== Cluster Status ===" curl -s http://localhost:9901/clusters
echo "" echo "=== Upstream Connection Errors ===" curl -s http://localhost:9901/stats | grep upstream_cx_connect_fail
echo "" echo "=== Health Check Status ===" curl -s http://localhost:9901/stats | grep health_check
echo "" echo "=== 5xx Responses ===" curl -s http://localhost:9901/stats | grep "_5xx"
echo "" echo "=== Active Connections ===" curl -s http://localhost:9901/stats | grep upstream_cx_active EOF
chmod +x /usr/local/bin/monitor-envoy.sh
# Prometheus alerts: - alert: EnvoyUpstreamConnectionFailed expr: rate(envoy_cluster_upstream_cx_connect_fail[5m]) > 0 for: 2m labels: severity: warning annotations: summary: "Envoy upstream connection failures"
- alert: EnvoyNoHealthyUpstream
- expr: envoy_cluster_health_check_healthy == 0
- for: 2m
- labels:
- severity: critical
- annotations:
- summary: "Envoy has no healthy upstream hosts"
`
Envoy Upstream Connection Checklist
| Check | Command | Expected |
|---|---|---|
| Cluster config | config_dump | Correct hosts |
| Endpoints | clusters | Healthy |
| Health check | /health endpoint | 200 OK |
| Connectivity | nc -zv | Connected |
| DNS | nslookup | Resolved |
| TLS | openssl s_client | Valid cert |
Verify the Fix
```bash # After fixing upstream connection
# 1. Check cluster healthy curl http://localhost:9901/clusters | grep api-service // No failed_active_hc
# 2. Test upstream request curl http://localhost:9901/stats | grep upstream_rq_200 // Requests succeeding
# 3. Check no connection errors curl http://localhost:9901/stats | grep connect_fail // 0 failures
# 4. Verify health checks curl http://localhost:9901/stats | grep health_check.healthy // > 0
# 5. Test from client curl http://api-gateway/api/users // Response 200
# 6. Monitor ongoing /usr/local/bin/monitor-envoy.sh // No errors ```
Related Issues
- [Fix HAProxy Backend Down](/articles/fix-haproxy-backend-down)
- [Fix Traefik Routing Not Working](/articles/fix-traefik-routing-not-working)
- [Fix Kubernetes Service Not Accessible](/articles/fix-kubernetes-service-not-accessible)