Introduction

Envoy proxy clusters become unhealthy when endpoints fail health checks, connectivity issues prevent connections, or configuration errors break cluster definitions. When a cluster has no healthy hosts, traffic cannot be routed to that destination, resulting in 503 errors or connection failures. Envoy's sophisticated health checking and outlier detection mechanisms require careful configuration.

Symptoms

Error messages in Envoy logs:

bash
[health_check][C0] health check failure for endpoint 10.0.0.1:8080
Cluster "backend" has no healthy hosts
Upstream cluster unhealthy, no endpoints available
Health check active timeout

Observable indicators: - Envoy admin interface shows cluster health 0% - Upstream requests failing with 503 - cluster.backend.members showing all unhealthy - Health check failure metrics increasing - Outlier detection ejecting endpoints

Common Causes

  1. 1.Health check path wrong - Endpoint path returns 404 or wrong status
  2. 2.Health check timeout too short - Backend slower than configured timeout
  3. 3.Network connectivity blocked - Firewall or routing preventing connections
  4. 4.EDS/CDS updates delayed - Control plane not updating endpoints
  5. 5.Outlier detection too aggressive - Legitimate endpoints ejected
  6. 6.TLS configuration mismatch - HTTPS endpoints with HTTP health check
  7. 7.Cluster configuration error - Wrong endpoint type or discovery mechanism

Step-by-Step Fix

Step 1: Check Cluster Status via Admin Interface

```bash # Access Envoy admin interface (typically port 9901) curl http://localhost:9901/clusters

# Get specific cluster info curl http://localhost:9901/clusters?format=json | jq '.cluster_statuses[] | select(.name=="backend")'

# Check cluster health percentage curl http://localhost:9901/clusters | grep backend

# List all endpoints curl http://localhost:9901/clusters | grep -A10 "backend::" ```

Step 2: Check Health Check Status

```bash # View health check statistics curl http://localhost:9901/stats?format=json | jq '.stats[] | select(.name | contains("health_check"))'

# Check health check failures curl http://localhost:9901/stats | grep health_check

# View outlier detection stats curl http://localhost:9901/stats | grep outlier_detection ```

Step 3: Check Backend Connectivity

```bash # Test endpoint directly from Envoy container/host curl -v http://10.0.0.1:8080/health

# Test TCP connectivity nc -zv 10.0.0.1 8080

# Check response timing curl -w "TTFB: %{time_starttransfer}s Total: %{time_total}s\n" \ http://10.0.0.1:8080/health ```

Step 4: Review Health Check Configuration

yaml
# Static configuration
static_resources:
  clusters:
    - name: backend
      type: STATIC
      connect_timeout: 5s
      lb_policy: ROUND_ROBIN
      load_assignment:
        cluster_name: backend
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 10.0.0.1
                      port_value: 8080
      health_checks:
        - timeout: 5s
          interval: 10s
          unhealthy_threshold: 3
          healthy_threshold: 2
          http_health_check:
            path: "/health"
            expected_statuses:
              - start: 200
                end: 300

Step 5: Fix Health Check Configuration

yaml
# Robust health check configuration
health_checks:
  - timeout: 10s            # Allow more time for slow responses
    interval: 15s           # Check interval
    unhealthy_threshold: 3  # Failures needed to mark unhealthy
    healthy_threshold: 2    # Successes needed to mark healthy
    http_health_check:
      path: "/healthz"      # Correct path
      request_headers_to_add:
        - header:
            key: "Host"
            value: "backend.example.com"
      expected_statuses:
        - start: 200
          end: 400          # Accept 200-399

Step 6: Fix TLS Health Checks

```yaml # HTTPS health check health_checks: - timeout: 10s interval: 15s http_health_check: path: "/health" scheme: "https" # Use HTTPS tls_config: server_name_indication: "backend.example.com" trusted_ca: inline_string: | -----BEGIN CERTIFICATE----- ... -----END CERTIFICATE-----

# Or use TCP health check for TLS health_checks: - timeout: 10s interval: 15s tcp_health_check: send: text: "GET /health HTTP/1.1\r\nHost: backend.example.com\r\n\r\n" ```

Step 7: Adjust Outlier Detection

yaml
# Prevent aggressive ejection
cluster:
  name: backend
  outlier_detection:
    consecutive_5xx: 10         # More failures before ejection
    interval: 30s
    base_ejection_time: 30s
    max_ejection_percent: 50    # Allow more endpoints to stay healthy
    enforcing_consecutive_5xx: 100
    enforcing_success_rate: 50  # Reduce success rate enforcement

Step 8: Check EDS Updates (Dynamic Config)

```bash # For dynamic xDS configuration # Check xDS subscription status curl http://localhost:9901/config_dump?format=json | jq '.configs[] | select(.type=="ROUTE")'

# Check cluster dynamic config curl http://localhost:9901/config_dump | grep -A20 "dynamic_active_clusters"

# View EDS endpoint updates curl http://localhost:9901/config_dump?format=json | jq '.dynamic_endpoint_configs'

# Check xDS update failures curl http://localhost:9901/stats | grep update_failure ```

Step 9: Verify the Fix

```bash # Check cluster health after changes curl http://localhost:9901/clusters | grep backend

# Monitor health check stats curl http://localhost:9901/stats | grep -E "backend.*health"

# Test through Envoy curl -v http://envoy-listener:8080/

# Watch endpoint status watch -n 1 'curl -s http://localhost:9901/clusters | grep backend' ```

Advanced Diagnosis

Check Envoy Logs

```bash # View Envoy logs docker logs envoy-proxy | grep -i "health_check|cluster"

# Check for connection errors docker logs envoy-proxy | grep -i "connection refused|timeout"

# Enable debug logging curl -X POST http://localhost:9901/logging?level=debug

# Reset logging curl -X POST http://localhost:9901/logging?level=info ```

Drain Endpoint Gracefully

```bash # Drain specific endpoint curl -X POST "http://localhost:9901/healthcheck_fail?cluster=backend&endpoint=10.0.0.1:8080"

# Check drained endpoint curl http://localhost:9901/clusters | grep "10.0.0.1"

# Re-enable endpoint curl -X POST "http://localhost:9901/healthcheck_ok?cluster=backend&endpoint=10.0.0.1:8080" ```

Check Circuit Breaker

```yaml # Check circuit breaker stats curl http://localhost:9901/stats | grep cx_open_overflow

# Adjust circuit breaker cluster: name: backend circuit_breakers: thresholds: - priority: DEFAULT max_connections: 1000 max_pending_requests: 100 max_requests: 1000 max_retries: 3 ```

Reset Outlier Detection

```bash # Reset outlier detection for cluster curl -X POST "http://localhost:9901/reset_clusters"

# Or restart Envoy to reset all state docker restart envoy-proxy ```

Common Pitfalls

  • Health check path missing - App doesn't have health endpoint
  • Timeout shorter than app response - Checks fail unnecessarily
  • Wrong expected status range - Expecting 200 but app returns 201
  • Outlier detection ejecting good hosts - Too aggressive thresholds
  • TLS on non-TLS endpoint - Health check fails due to protocol mismatch
  • EDS not updating - Control plane issue, stale endpoints
  • Missing Host header - App requires specific Host header

Best Practices

yaml
# Complete cluster configuration
static_resources:
  clusters:
    - name: backend_service
      type: EDS
      eds_cluster_config:
        eds_config:
          api_config_source:
            api_type: GRPC
            grpc_services:
              - envoy_grpc:
                  cluster_name: xds_cluster
      connect_timeout: 5s
      lb_policy: ROUND_ROBIN
      circuit_breakers:
        thresholds:
          - priority: DEFAULT
            max_connections: 1024
            max_pending_requests: 1024
      health_checks:
        - timeout: 10s
          interval: 15s
          unhealthy_threshold: 3
          healthy_threshold: 2
          http_health_check:
            path: "/healthz"
            request_headers_to_add:
              - header:
                  key: "Host"
                  value: "backend.internal"
            expected_statuses:
              - start: 200
                end: 300
      outlier_detection:
        consecutive_5xx: 10
        interval: 30s
        base_ejection_time: 30s
        max_ejection_percent: 30
  • Istio Destination Rule Error
  • HAProxy Health Check Failing
  • Envoy Upstream Connection Timeout
  • Consul Connect Proxy Error