Introduction

Load balancer 503 Service Unavailable errors occur when the load balancer cannot route requests to any healthy backend servers, causing all requests to fail with HTTP 503 status. This error indicates the load balancer itself is operational but has no valid targets to forward traffic. Common causes include all backend servers failing health checks, health check endpoint misconfigured or returning non-2xx responses, backend servers overloaded and not responding within timeout, connection pool exhaustion between load balancer and backends, SSL/TLS certificate expiration on backend servers, network security groups blocking load balancer probes, backend application crash or restart cycle, DNS resolution failures for backend hostnames, upstream server timeout set too low for actual response times, and load balancer configuration errors (wrong ports, protocols, or paths). The fix requires systematic diagnosis of health check status, backend server connectivity, application response times, and load balancer configuration. This guide provides production-proven troubleshooting for 503 errors across AWS Application Load Balancer, NGINX, HAProxy, Kubernetes ingress controllers, and cloud-native load balancing services.

Symptoms

  • HTTP 503 Service Unavailable for all requests through load balancer
  • Intermittent 503 errors affecting subset of requests
  • Load balancer health check dashboard shows all backends unhealthy
  • Backend servers accessible directly but not through load balancer
  • upstream_connect_timeout or upstream_read_timeout in load balancer logs
  • Connection refused errors to backend ports
  • SSL handshake failures between load balancer and backends
  • Backend servers show high CPU/memory during 503 periods
  • Load balancer metrics show zero healthy hosts
  • Kubernetes ingress shows endpoints but 503 in access logs

Common Causes

  • All backend servers failing health check probes
  • Health check endpoint returns 4xx or 5xx status
  • Health check timeout shorter than application response time
  • Backend application process crashed or restarting
  • Firewall/security group blocking load balancer IP ranges
  • Backend port mismatch (load balancer configured for wrong port)
  • Connection pool exhausted, new connections rejected
  • SSL certificate expired on backend servers (for HTTPS upstreams)
  • DNS resolution failures for backend hostnames
  • Backend server resource exhaustion (CPU, memory, file descriptors)
  • Load balancer rate limiting or WAF blocking traffic
  • Kubernetes pods not ready (readiness probe failing)

Step-by-Step Fix

### 1. Diagnose backend health status

Check load balancer health check status:

```bash # AWS ALB - Check target group health aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/id

# Output shows: # { # "TargetHealthDescriptions": [ # { # "Target": {"Id": "10.0.1.1", "Port": 8080}, # "HealthCheckPort": "8080", # "TargetHealth": { # "State": "unhealthy", # "Reason": "Target.FailedHealthChecks", # "Description": "Health checks failed" # } # } # ] # }

# AWS ALB - View health check configuration aws elbv2 describe-target-group-attributes \ --target-group-arn <arn>

# Key settings: # - healthy_threshold_count (default: 3) # - unhealthy_threshold_count (default: 2) # - health_check_interval_seconds (default: 30) # - health_check_timeout_seconds (default: 5) # - health_check_path (default: /) # - health_check_protocol (HTTP/HTTPS) # - matcher (HTTP codes considered healthy, default: 200) ```

NGINX upstream status:

```bash # Check NGINX upstream status (requires ngx_http_upstream_module) # Enable in nginx.conf: # upstream_status_zone_path /upstream_status;

curl http://nginx-host/upstream_status

# Check error logs for upstream failures tail -f /var/log/nginx/error.log | grep -i upstream

# Common errors: # - upstream timed out (110: Connection timed out) # - connect() failed (111: Connection refused) # - no live upstreams while connecting to upstream # - upstream prematurely closed connection

# Test backend connectivity from NGINX server curl -v http://backend-server:port/health ```

HAProxy backend status:

```bash # HAProxy stats page (enable in config) # defaults # stats enable # stats uri /haproxy?stats

curl http://haproxy-host/haproxy?stats

# HAProxy runtime socket echo "show servers state" | socat stdio /var/run/haproxy.sock echo "show servers health" | socat stdio /var/run/haproxy.sock

# Check HAProxy logs tail -f /var/log/haproxy.log

# Common log patterns: # - "Layer4 timeout" - TCP connection timeout # - "Layer4 connection problem" - Connection refused # - "Layer7 invalid response" - Backend returned invalid HTTP # - "no server available" - All backends down ```

Kubernetes ingress/pod status:

```bash # Check ingress controller pods kubectl get pods -n ingress-nginx

# Check endpoint slices (Kubernetes 1.21+) kubectl get endpointslices -n default

# Check specific service endpoints kubectl get endpoints <service-name> kubectl describe endpoints <service-name>

# Check pod readiness kubectl get pods -l app=<app-label> kubectl describe pod <pod-name> | grep -A5 "Ready"

# View readiness probe configuration kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].readinessProbe}'

# Check ingress resource kubectl describe ingress <ingress-name> kubectl get ingress <ingress-name> -o yaml ```

### 2. Fix health check configuration

Adjust health check parameters:

```yaml # AWS ALB - Terraform configuration resource "aws_lb_target_group" "main" { name = "main-tg" port = 8080 protocol = "HTTP" vpc_id = aws_vpc.main.id

# Health check settings health_check { enabled = true path = "/health" protocol = "HTTP" interval = 30 # Seconds between checks timeout = 10 # Must be less than interval healthy_threshold = 2 # Consecutive successes needed unhealthy_threshold = 3 # Consecutive failures before marking unhealthy matcher = "200-299" # Accept any 2xx as healthy }

# Stickiness (if needed) stickiness { type = "lb_cookie" cookie_duration = 86400 } }

# Common fixes: # - Increase timeout if application slow to respond # - Reduce unhealthy_threshold for faster detection # - Change path to lightweight health endpoint # - Adjust matcher to accept application-specific codes ```

NGINX health check configuration:

```nginx # nginx.conf - Upstream with health checks upstream backend { least_conn; # Load balancing algorithm

server 10.0.1.1:8080 max_fails=3 fail_timeout=30s; server 10.0.1.2:8080 max_fails=3 fail_timeout=30s; server 10.0.1.3:8080 max_fails=3 fail_timeout=30s backup;

# Keepalive connections to backends keepalive 32; keepalive_timeout 60s; keepalive_requests 1000; }

server { listen 80;

location / { proxy_pass http://backend;

# Timeout configuration proxy_connect_timeout 5s; proxy_send_timeout 60s; proxy_read_timeout 60s;

# Retry on upstream failures proxy_next_upstream error timeout http_502 http_503 http_504; proxy_next_upstream_tries 3;

# Health check endpoint (NGINX Plus) # health_check interval=5s fails=3 passes=2 uri=/health; }

# Application health endpoint location /health { access_log off; return 200 "healthy\n"; add_header Content-Type text/plain; } } ```

HAProxy health check configuration:

```haproxy # /etc/haproxy/haproxy.cfg global log /dev/log local0 maxconn 4096

defaults log global mode http option httplog timeout connect 5s timeout client 30s timeout server 30s timeout http-request 10s retries 3

frontend http_front bind *:80 default_backend app_servers

backend app_servers balance roundrobin

# Health check configuration option httpchk GET /health HTTP/1.1\r\nHost:\ localhost http-check expect status 200

# Backend servers with health check parameters server app1 10.0.1.1:8080 check inter 5s fall 3 rise 2 server app2 10.0.1.2:8080 check inter 5s fall 3 rise 2 server app3 10.0.1.3:8080 check inter 5s fall 3 rise 2 backup

# check: Enable health checks # inter: Interval between checks (5s) # fall: Failures before marking down (3) # rise: Successes before marking up (2) ```

Kubernetes readiness probe configuration:

```yaml # deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 3 selector: matchLabels: app: my-app template: spec: containers: - name: app image: my-app:latest ports: - containerPort: 8080

# Readiness probe (determines if pod receives traffic) readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 # Wait before first probe periodSeconds: 10 # Between probes timeoutSeconds: 5 # Probe timeout successThreshold: 1 # Successes to mark ready failureThreshold: 3 # Failures to mark not-ready

# Liveness probe (determines if pod should be restarted) livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 15 timeoutSeconds: 5 failureThreshold: 3

resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" ```

### 3. Fix backend connectivity issues

Test backend accessibility:

```bash # From load balancer instance, test backend connectivity # AWS: Use EC2 instance in same subnet as ALB

# TCP connectivity test nc -zv backend-ip 8080 telnet backend-ip 8080

# HTTP health check simulation curl -v http://backend-ip:8080/health curl -v https://backend-ip:8443/health --insecure

# Check if backend is listening ss -tlnp | grep 8080 netstat -tlnp | grep 8080

# Test from application server itself curl http://localhost:8080/health

# If localhost works but IP doesn't: # - Application binding to 127.0.0.1 instead of 0.0.0.0 # - Firewall blocking external access ```

Security group and firewall configuration:

```bash # AWS Security Group - Allow load balancer to backend aws ec2 authorize-security-group-ingress \ --group-id sg-backend \ --protocol tcp \ --port 8080 \ --source-group sg-loadbalancer

# Terraform security group rules resource "aws_security_group_rule" "lb_to_backend" { type = "ingress" from_port = 8080 to_port = 8080 protocol = "tcp" security_group_id = aws_security_group.backend.id source_security_group_id = aws_security_group.lb.id description = "Allow traffic from load balancer" }

# Linux firewall (iptables) iptables -A INPUT -p tcp --dport 8080 -s 10.0.0.0/16 -j ACCEPT

# Linux firewall (firewalld) firewall-cmd --permanent --add-port=8080/tcp firewall-cmd --permanent --add-source=10.0.0.0/16 --add-port=8080/tcp firewall-cmd --reload ```

DNS resolution for backend hostnames:

```bash # If using hostnames instead of IPs dig backend.example.com nslookup backend.example.com

# Check /etc/hosts entries cat /etc/hosts | grep backend

# Test resolution from load balancer getent hosts backend.example.com

# Kubernetes CoreDNS kubectl get pods -n kube-system -l k8s-app=kube-dns kubectl logs -n kube-system -l k8s-app=kube-dns

# Fix CoreDNS issues kubectl rollout restart deployment/coredns -n kube-system ```

### 4. Fix connection pool and timeout issues

NGINX connection tuning:

```nginx # nginx.conf - Connection pool optimization http { # Upstream keepalive upstream backend { server 10.0.1.1:8080; server 10.0.1.2:8080; keepalive 64; # Idle connections to keep }

server { location / { proxy_pass http://backend;

# HTTP/1.1 for keepalive proxy_http_version 1.1; proxy_set_header Connection "";

# Timeout tuning proxy_connect_timeout 10s; # Time to establish connection proxy_send_timeout 30s; # Time between client sends proxy_read_timeout 60s; # Time between backend responses

# Buffer configuration proxy_buffering on; proxy_buffer_size 4k; proxy_buffers 8 4k;

# Retry configuration proxy_next_upstream error timeout http_502 http_503; proxy_next_upstream_tries 2; } } } ```

HAProxy connection tuning:

```haproxy global maxconn 4096 # Global connection limit nbthread 4 # Number of threads

defaults timeout connect 10s # Backend connection timeout timeout client 30s # Client inactivity timeout timeout server 60s # Backend response timeout timeout http-request 10s timeout http-keep-alive 10s timeout queue 30s # Time in queue if no server available

# Retry configuration retries 3 option redispatch # Retry on different server after failure

backend app_servers # Connection limiting per server server app1 10.0.1.1:8080 check maxconn 100 queue 50 server app2 10.0.1.2:8080 check maxconn 100 queue 50

# Circuit breaker (HAProxy Enterprise) # stick-table type ip size 100k expire 30s store http_err_rate(10s) # http-request track-sc0 src # http-request deny if { sc_http_err_rate(0) gt 50 } ```

AWS ALB connection settings:

```yaml # CloudFormation - Load Balancer attributes Resources: LoadBalancer: Type: AWS::ElasticLoadBalancingV2::LoadBalancer Properties: Type: application LoadBalancerAttributes: - Key: idle_timeout.timeout_seconds Value: "60" - Key: connection_logs.s3.enabled Value: "true" - Key: deletion_protection.enabled Value: "false"

# Target group attributes TargetGroup: Type: AWS::ElasticLoadBalancingV2::TargetGroup Properties: TargetGroupAttributes: - Key: stickiness.enabled Value: "false" - Key: deregistration_delay.timeout_seconds Value: "300" # Wait 5 minutes before deregistering - Key: slow_start.duration_seconds Value: "60" # Gradually increase traffic ```

### 5. Fix SSL/TLS backend issues

HTTPS upstream configuration:

```nginx # NGINX with HTTPS backends upstream secure_backend { server 10.0.1.1:8443; server 10.0.1.2:8443; keepalive 32; }

server { location / { proxy_pass https://secure_backend;

# SSL verification proxy_ssl_verify on; proxy_ssl_trusted_certificate /etc/ssl/certs/ca-bundle.crt; proxy_ssl_verify_depth 2;

# SSL session reuse proxy_ssl_session_reuse on; proxy_ssl_session_cache shared:SSL:10m; proxy_ssl_session_timeout 10m;

# Protocol and ciphers proxy_ssl_protocols TLSv1.2 TLSv1.3; proxy_ssl_ciphers HIGH:!aNULL:!MD5; } } ```

HAProxy with SSL backends:

```haproxy backend secure_app_servers mode http balance roundroption httpchk GET /health

# SSL configuration server app1 10.0.1.1:8443 check ssl verify required ca-file /etc/ssl/certs/ca-bundle.crt server app2 10.0.1.2:8443 check ssl verify required ca-file /etc/ssl/certs/ca-bundle.crt

# Or skip verification (not recommended for production) # server app1 10.0.1.1:8443 check ssl verify none ```

Certificate validation troubleshooting:

```bash # Check backend certificate openssl s_client -connect backend:8443 -showcerts

# Verify certificate chain openssl verify -CAfile ca-bundle.crt backend.crt

# Check certificate expiration openssl x509 -in backend.crt -noout -dates

# Common SSL errors: # - certificate verify failed (CA not trusted) # - certificate has expired # - hostname mismatch (CN/SAN doesn't match) # - self-signed certificate (not in trust store) ```

### 6. Fix Kubernetes-specific 503 issues

Debug ingress controller 503:

```bash # Check ingress controller logs kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Look for patterns: # - "no upstream host found" # - "service does not exist" # - "endpoints not found" # - "connection refused"

# Check service exists and has endpoints kubectl get service <service-name> kubectl get endpoints <service-name>

# Verify ingress resource configuration kubectl get ingress <ingress-name> -o yaml kubectl describe ingress <ingress-name>

# Check if backend pods are ready kubectl get pods -l app=<app-label> -o wide kubectl describe pod <pod-name> | grep -A10 "Conditions"

# Test service directly (bypass ingress) kubectl run test --rm -it --image=busybox -- sh wget -qO- http://<service-name>.<namespace>.svc.cluster.local/health ```

Fix common Kubernetes ingress issues:

yaml # Correct ingress configuration apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: my-app-ingress annotations: nginx.ingress.kubernetes.io/ssl-redirect: "true" nginx.ingress.kubernetes.io/proxy-connect-timeout: "10" nginx.ingress.kubernetes.io/proxy-read-timeout: "60" nginx.ingress.kubernetes.io/proxy-send-timeout: "60" spec: ingressClassName: nginx rules: - host: app.example.com http: paths: - path: / pathType: Prefix backend: service: name: my-app-service # Must match service name port: number: 80 # Must match service port tls: - hosts: - app.example.com secretName: app-tls-secret

Service and endpoint troubleshooting:

```bash # Service without selector (external endpoint) apiVersion: v1 kind: Service metadata: name: external-service spec: ports: - port: 80 targetPort: 8080 # No selector = no endpoints created automatically # Create endpoint manually:

--- apiVersion: v1 kind: Endpoints metadata: name: external-service subsets: - addresses: - ip: 10.0.1.100 # External IP ports: - port: 8080

# Check endpoint slice (Kubernetes 1.21+) kubectl get endpointslices kubectl describe endpointslice <name>

# Force endpoint refresh kubectl delete pod <pod-name> # Pod restarts and re-registers ```

### 7. Implement graceful degradation

Circuit breaker pattern:

```nginx # NGINX Plus circuit breaker upstream backend { zone backend_zone 64k;

server 10.0.1.1:8080; server 10.0.1.2:8080;

# Circuit breaker settings max_fails=3 fail_timeout=30s; }

server { location / { proxy_pass http://backend;

# Fallback on failure proxy_next_upstream error timeout http_502 http_503 http_504; proxy_next_upstream_tries 2; }

# Fallback endpoint location @fallback { return 200 '{"status": "degraded", "message": "Service temporarily unavailable"}'; add_header Content-Type application/json; } } ```

Static error page during outage:

```nginx # NGINX custom error page server { location / { proxy_pass http://backend;

# Custom error pages error_page 502 503 504 /50x.html;

location = /50x.html { root /usr/share/nginx/html; internal; } } } ```

Prevention

  • Implement meaningful health check endpoints that verify dependencies
  • Configure health check timeouts longer than typical response times
  • Use gradual rollouts with slow_start or canary deployments
  • Set appropriate deregistration_delay to drain connections
  • Monitor backend error rates and latency with alerting
  • Implement circuit breakers to fail fast during outages
  • Use multiple availability zones for backend servers
  • Document runbooks for common 503 scenarios
  • Test failover scenarios regularly with chaos engineering
  • Keep SSL certificates updated with automated renewal
  • **502 Bad Gateway**: Backend returned invalid response
  • **504 Gateway Timeout**: Backend didn't respond within timeout
  • **Connection refused**: Backend not listening on expected port
  • **SSL certificate verify failed**: Backend certificate not trusted
  • **No live upstreams**: All backends marked unhealthy