Introduction

GCP Load Balancer backend down errors occur when backend instances, instance groups, or backend services are marked unhealthy and removed from traffic distribution. This can result from health check failures, firewall rules blocking probe traffic, instance group issues, or backend service configuration errors. GCP's global load balancing architecture adds complexity with cross-region health checks.

Symptoms

Error indicators in GCP Console:

bash
Backend instance is unhealthy
Health check failed for backend
No healthy backends available
Backend service has 0 healthy instances

Observable indicators: - Backend instances show "UNHEALTHY" status - HTTP 502/503 responses from load balancer - Traffic not reaching backend service - backend_request_count metric at zero - Cloud Monitoring shows health check failures

Common Causes

  1. 1.Health check firewall rules missing - 1300.61.0.0/16 not allowed
  2. 2.Health check configuration mismatch - Wrong path, port, or protocol
  3. 3.Instance group configuration issues - Autoscaler, template, or zone problems
  4. 4.Backend service timeout too short - Backend can't respond in time
  5. 5.Instance not running - VM stopped or crashed
  6. 6.Application not listening on port - Service not started on expected port
  7. 7.SSL certificate issues - HTTPS health check with invalid cert

Step-by-Step Fix

Step 1: Check Backend Health Status

```bash # List backend services gcloud compute backend-services list

# Get backend health gcloud compute backend-services get-health my-backend --global

# Check specific backend instance gcloud compute backend-services get-health my-backend --global \ --format="table(name,healthState,healthCheck)"

# List instance groups gcloud compute instance-groups list

# Check instance group instances gcloud compute instance-groups list-instances my-ig --zone us-central1-a ```

Step 2: Check Health Check Configuration

```bash # List health checks gcloud compute health-checks list

# Get health check details gcloud compute health-checks describe my-health-check

# Check HTTP health check gcloud compute http-health-checks describe my-http-health-check

# Check HTTPS health check gcloud compute https-health-checks describe my-https-health-check ```

Step 3: Test Health Endpoint from Instance

```bash # SSH to backend instance gcloud compute ssh my-instance --zone us-central1-a

# Test health endpoint locally curl -v http://localhost:80/health

# Test exact path and port curl -v http://127.0.0.1:80/health

# Check timing curl -w "TTFB: %{time_starttransfer}s\n" http://localhost/health

# Check if app is listening sudo netstat -tlnp | grep :80 sudo ss -tlnp | grep :80 ```

Step 4: Check Firewall Rules for Health Checks

```bash # List firewall rules gcloud compute firewall-rules list --filter="network:default"

# Check for health check rule gcloud compute firewall-rules describe allow-health-checks

# Check if 35.191.0.0/16 and 130.211.0.0/22 are allowed gcloud compute firewall-rules list \ --filter="sourceRanges:35.191.0.0/16 OR sourceRanges:130.211.0.0/22" ```

```bash # Create firewall rule for health checks gcloud compute firewall-rules create allow-health-checks \ --network default \ --allow tcp:80,tcp:443 \ --source-ranges 35.191.0.0/16,130.211.0.0/22 \ --target-tags health-check-allow

# Apply tag to instances gcloud compute instances add-tags my-instance \ --tags health-check-allow --zone us-central1-a ```

Step 5: Fix Health Check Configuration

```bash # Update HTTP health check gcloud compute health-checks update http my-health-check \ --port 80 \ --request-path /health \ --check-interval 30s \ --timeout 10s \ --healthy-threshold 2 \ --unhealthy-threshold 3

# Create new health check gcloud compute health-checks create http app-health-check \ --port 80 \ --request-path /health \ --check-interval 30s \ --timeout 10s \ --healthy-threshold 2 \ --unhealthy-threshold 3

# For HTTPS gcloud compute health-checks create https app-https-health-check \ --port 443 \ --request-path /health \ --check-interval 30s ```

Step 6: Update Backend Service

```bash # Update backend service health check gcloud compute backend-services update my-backend --global \ --health-checks app-health-check

# Update timeout settings gcloud compute backend-services update my-backend --global \ --timeout-sec 60

# Check backend service configuration gcloud compute backend-services describe my-backend --global ```

Step 7: Fix Instance Group Issues

```bash # Check instance template gcloud compute instance-templates describe my-template

# Update instance group gcloud compute instance-groups managed set-instance-template my-ig \ --template my-new-template --zone us-central1-a

# Recreate unhealthy instances gcloud compute instance-groups managed recreate-instances my-ig \ --instances my-instance-1 --zone us-central1-a

# Check autoscaler gcloud compute instance-groups managed describe my-ig --zone us-central1-a ```

Step 8: Verify the Fix

```bash # Check backend health after changes gcloud compute backend-services get-health my-backend --global

# Test through load balancer curl -v http://lb-ip-address/

# Monitor health check metrics gcloud monitoring read "metric.type=loadbalancing.googleapis.com/https/backend_request_count" \ --project=my-project --freshness=1h ```

Advanced Diagnosis

Check Health Check Logs

```bash # Enable logging for health checks # Cloud Logging automatically captures health check probe logs

# Query health check logs gcloud logging read "resource.type=gce_instance AND jsonPayload.healthCheckProbeName=my-health-check" \ --project=my-project --freshness=1h

# Filter for failed probes gcloud logging read "resource.type=gce_instance AND jsonPayload.healthCheckProbeResult=UNHEALTHY" \ --project=my-project --freshness=1h ```

Check Instance Status

```bash # Get instance status gcloud compute instances describe my-instance --zone us-central1-a \ --format="table(name,status)"

# Check instance serial console gcloud compute instances get-serial-port-output my-instance --zone us-central1-a

# Check for instance errors gcloud compute operations list --filter="targetLink:my-instance AND status:DONE" ```

Regional vs Global Health Checks

```bash # Regional health check for internal LB gcloud compute health-checks create http regional-health-check \ --region us-central1 \ --port 80 \ --request-path /health

# Use regional health check with regional backend service gcloud compute backend-services create my-internal-backend \ --region us-central1 \ --health-checks regional-health-check \ --load-balancing-scheme INTERNAL ```

TCP/SSL Health Checks

```bash # TCP health check (no HTTP) gcloud compute health-checks create tcp tcp-health-check \ --port 3306 \ --check-interval 30s

# SSL/TLS health check gcloud compute health-checks create ssl ssl-health-check \ --port 443 \ --check-interval 30s ```

Common Pitfalls

  • Missing firewall rule - Most common cause of GCP health check failures
  • Wrong firewall source ranges - Using incorrect IP ranges for probes
  • HTTP health check on HTTPS port - Protocol mismatch
  • Instance tag not applied - Firewall rule targets wrong instances
  • Global vs regional mismatch - Health check region doesn't match backend
  • Timeout shorter than app response - Health check too aggressive
  • Unhealthy threshold too low - Single failure marks unhealthy

Best Practices

```bash # Create comprehensive firewall rule gcloud compute firewall-rules create allow-all-health-checks \ --network default \ --allow tcp \ --source-ranges 35.191.0.0/16,130.211.0.0/22,209.85.152.0/22,209.85.204.0/22 \ --target-tags health-check-allow \ --description "Allow GCP health check probes"

# Create robust health check gcloud compute health-checks create http robust-health-check \ --port 80 \ --request-path /health \ --response-body "healthy" \ --check-interval 30s \ --timeout 10s \ --healthy-threshold 2 \ --unhealthy-threshold 3 \ --log-success

# Configure backend service properly gcloud compute backend-services create app-backend --global \ --protocol HTTP \ --port-name http \ --timeout-sec 60 \ --health-checks robust-health-check \ --enable-cdn false ```

```yaml # Terraform configuration resource "google_compute_health_check" "app" { name = "app-health-check"

http_health_check { port = 80 request_path = "/health" }

check_interval_sec = 30 timeout_sec = 10 healthy_threshold = 2 unhealthy_threshold = 3 }

resource "google_compute_firewall" "health_checks" { name = "allow-health-checks" network = google_compute_network.default.name

allow { protocol = "tcp" ports = ["80", "443"] }

source_ranges = [ "35.191.0.0/16", "130.211.0.0/22" ]

target_tags = ["health-check-allow"] }

resource "google_compute_backend_service" "app" { name = "app-backend" protocol = "HTTP" timeout_sec = 60

health_checks = [google_compute_health_check.app.id]

backend { group = google_compute_instance_group_manager.app.instance_group } } ```

  • AWS ALB Health Check Failing
  • Azure Load Balancer Health Probe
  • HAProxy Backend Down
  • F5 BIG-IP Pool Member Down