Introduction
Envoy proxy clusters become unhealthy when endpoints fail health checks, connectivity issues prevent connections, or configuration errors break cluster definitions. When a cluster has no healthy hosts, traffic cannot be routed to that destination, resulting in 503 errors or connection failures. Envoy's sophisticated health checking and outlier detection mechanisms require careful configuration.
Symptoms
Error messages in Envoy logs:
[health_check][C0] health check failure for endpoint 10.0.0.1:8080
Cluster "backend" has no healthy hosts
Upstream cluster unhealthy, no endpoints available
Health check active timeoutObservable indicators:
- Envoy admin interface shows cluster health 0%
- Upstream requests failing with 503
- cluster.backend.members showing all unhealthy
- Health check failure metrics increasing
- Outlier detection ejecting endpoints
Common Causes
- 1.Health check path wrong - Endpoint path returns 404 or wrong status
- 2.Health check timeout too short - Backend slower than configured timeout
- 3.Network connectivity blocked - Firewall or routing preventing connections
- 4.EDS/CDS updates delayed - Control plane not updating endpoints
- 5.Outlier detection too aggressive - Legitimate endpoints ejected
- 6.TLS configuration mismatch - HTTPS endpoints with HTTP health check
- 7.Cluster configuration error - Wrong endpoint type or discovery mechanism
Step-by-Step Fix
Step 1: Check Cluster Status via Admin Interface
```bash # Access Envoy admin interface (typically port 9901) curl http://localhost:9901/clusters
# Get specific cluster info curl http://localhost:9901/clusters?format=json | jq '.cluster_statuses[] | select(.name=="backend")'
# Check cluster health percentage curl http://localhost:9901/clusters | grep backend
# List all endpoints curl http://localhost:9901/clusters | grep -A10 "backend::" ```
Step 2: Check Health Check Status
```bash # View health check statistics curl http://localhost:9901/stats?format=json | jq '.stats[] | select(.name | contains("health_check"))'
# Check health check failures curl http://localhost:9901/stats | grep health_check
# View outlier detection stats curl http://localhost:9901/stats | grep outlier_detection ```
Step 3: Check Backend Connectivity
```bash # Test endpoint directly from Envoy container/host curl -v http://10.0.0.1:8080/health
# Test TCP connectivity nc -zv 10.0.0.1 8080
# Check response timing curl -w "TTFB: %{time_starttransfer}s Total: %{time_total}s\n" \ http://10.0.0.1:8080/health ```
Step 4: Review Health Check Configuration
# Static configuration
static_resources:
clusters:
- name: backend
type: STATIC
connect_timeout: 5s
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: backend
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 10.0.0.1
port_value: 8080
health_checks:
- timeout: 5s
interval: 10s
unhealthy_threshold: 3
healthy_threshold: 2
http_health_check:
path: "/health"
expected_statuses:
- start: 200
end: 300Step 5: Fix Health Check Configuration
# Robust health check configuration
health_checks:
- timeout: 10s # Allow more time for slow responses
interval: 15s # Check interval
unhealthy_threshold: 3 # Failures needed to mark unhealthy
healthy_threshold: 2 # Successes needed to mark healthy
http_health_check:
path: "/healthz" # Correct path
request_headers_to_add:
- header:
key: "Host"
value: "backend.example.com"
expected_statuses:
- start: 200
end: 400 # Accept 200-399Step 6: Fix TLS Health Checks
```yaml # HTTPS health check health_checks: - timeout: 10s interval: 15s http_health_check: path: "/health" scheme: "https" # Use HTTPS tls_config: server_name_indication: "backend.example.com" trusted_ca: inline_string: | -----BEGIN CERTIFICATE----- ... -----END CERTIFICATE-----
# Or use TCP health check for TLS health_checks: - timeout: 10s interval: 15s tcp_health_check: send: text: "GET /health HTTP/1.1\r\nHost: backend.example.com\r\n\r\n" ```
Step 7: Adjust Outlier Detection
# Prevent aggressive ejection
cluster:
name: backend
outlier_detection:
consecutive_5xx: 10 # More failures before ejection
interval: 30s
base_ejection_time: 30s
max_ejection_percent: 50 # Allow more endpoints to stay healthy
enforcing_consecutive_5xx: 100
enforcing_success_rate: 50 # Reduce success rate enforcementStep 8: Check EDS Updates (Dynamic Config)
```bash # For dynamic xDS configuration # Check xDS subscription status curl http://localhost:9901/config_dump?format=json | jq '.configs[] | select(.type=="ROUTE")'
# Check cluster dynamic config curl http://localhost:9901/config_dump | grep -A20 "dynamic_active_clusters"
# View EDS endpoint updates curl http://localhost:9901/config_dump?format=json | jq '.dynamic_endpoint_configs'
# Check xDS update failures curl http://localhost:9901/stats | grep update_failure ```
Step 9: Verify the Fix
```bash # Check cluster health after changes curl http://localhost:9901/clusters | grep backend
# Monitor health check stats curl http://localhost:9901/stats | grep -E "backend.*health"
# Test through Envoy curl -v http://envoy-listener:8080/
# Watch endpoint status watch -n 1 'curl -s http://localhost:9901/clusters | grep backend' ```
Advanced Diagnosis
Check Envoy Logs
```bash # View Envoy logs docker logs envoy-proxy | grep -i "health_check|cluster"
# Check for connection errors docker logs envoy-proxy | grep -i "connection refused|timeout"
# Enable debug logging curl -X POST http://localhost:9901/logging?level=debug
# Reset logging curl -X POST http://localhost:9901/logging?level=info ```
Drain Endpoint Gracefully
```bash # Drain specific endpoint curl -X POST "http://localhost:9901/healthcheck_fail?cluster=backend&endpoint=10.0.0.1:8080"
# Check drained endpoint curl http://localhost:9901/clusters | grep "10.0.0.1"
# Re-enable endpoint curl -X POST "http://localhost:9901/healthcheck_ok?cluster=backend&endpoint=10.0.0.1:8080" ```
Check Circuit Breaker
```yaml # Check circuit breaker stats curl http://localhost:9901/stats | grep cx_open_overflow
# Adjust circuit breaker cluster: name: backend circuit_breakers: thresholds: - priority: DEFAULT max_connections: 1000 max_pending_requests: 100 max_requests: 1000 max_retries: 3 ```
Reset Outlier Detection
```bash # Reset outlier detection for cluster curl -X POST "http://localhost:9901/reset_clusters"
# Or restart Envoy to reset all state docker restart envoy-proxy ```
Common Pitfalls
- Health check path missing - App doesn't have health endpoint
- Timeout shorter than app response - Checks fail unnecessarily
- Wrong expected status range - Expecting 200 but app returns 201
- Outlier detection ejecting good hosts - Too aggressive thresholds
- TLS on non-TLS endpoint - Health check fails due to protocol mismatch
- EDS not updating - Control plane issue, stale endpoints
- Missing Host header - App requires specific Host header
Best Practices
# Complete cluster configuration
static_resources:
clusters:
- name: backend_service
type: EDS
eds_cluster_config:
eds_config:
api_config_source:
api_type: GRPC
grpc_services:
- envoy_grpc:
cluster_name: xds_cluster
connect_timeout: 5s
lb_policy: ROUND_ROBIN
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 1024
max_pending_requests: 1024
health_checks:
- timeout: 10s
interval: 15s
unhealthy_threshold: 3
healthy_threshold: 2
http_health_check:
path: "/healthz"
request_headers_to_add:
- header:
key: "Host"
value: "backend.internal"
expected_statuses:
- start: 200
end: 300
outlier_detection:
consecutive_5xx: 10
interval: 30s
base_ejection_time: 30s
max_ejection_percent: 30Related Issues
- Istio Destination Rule Error
- HAProxy Health Check Failing
- Envoy Upstream Connection Timeout
- Consul Connect Proxy Error