Fix Nginx Upstream Not Load Balancing

Introduction

When Nginx load balancing appears uneven, some backend servers receive significantly more traffic than others, or traffic doesn't follow the expected distribution. This can result from configuration errors, health check issues, session persistence, or algorithm misconfigurations. Uneven load distribution causes some servers to be overloaded while others remain underutilized.

Symptoms

Observable indicators: - One backend server shows significantly higher CPU/memory usage - Request counts per server vary widely in access logs - Some servers show high error rates while others are fine - Nginx status shows all servers UP but distribution is uneven - Sticky sessions causing traffic imbalance

Error patterns in logs: ``backend1: 10000 requests backend2: 100 requests # Clearly imbalanced

Common Causes

1.Weight not configured - All servers treated equally despite different capacities
2.IP hash distribution - Uneven client IP distribution
3.Sticky sessions - Session persistence pinning users to specific servers
4.Keepalive connections - Connection reuse favoring certain servers
5.Health checks marking servers down - Traffic concentrated on remaining servers
6.Least_conn without considering capacity - Slow servers getting more connections
7.Upstream zone not shared - Each worker has different state

Step-by-Step Fix

Step 1: Check Current Load Distribution

```bash # Check Nginx upstream status curl http://localhost/nginx_status 2>/dev/null

# Analyze access logs for distribution awk '{print $upstream_addr}' /var/log/nginx/access.log | sort | uniq -c | sort -rn

# Or if upstream address in specific position cat /var/log/nginx/access.log | grep -oP 'upstream: "\K[^"]+' | sort | uniq -c

# Check backend server metrics for server in 10.0.0.1 10.0.0.2 10.0.0.3; do echo "=== $server ===" ssh $server 'uptime; netstat -an | grep :8080 | wc -l' done ```

Step 2: Review Upstream Configuration

```bash # Show current configuration nginx -T 2>/dev/null | grep -A20 "upstream"

# Check for load balancing method nginx -T 2>/dev/null | grep -E "upstream|ip_hash|least_conn|hash" ```

Step 3: Fix Load Balancing Algorithm

```nginx upstream backend_servers { # Option 1: Round Robin (default) - equal distribution server 10.0.0.1:8080; server 10.0.0.2:8080; server 10.0.0.3:8080; }

upstream backend_servers { # Option 2: Weighted Round Robin - capacity-based distribution server 10.0.0.1:8080 weight=3; # 3x traffic server 10.0.0.2:8080 weight=2; # 2x traffic server 10.0.0.3:8080 weight=1; # 1x traffic }

upstream backend_servers { # Option 3: Least Connections - for varying request durations least_conn; server 10.0.0.1:8080; server 10.0.0.2:8080; server 10.0.0.3:8080; }

upstream backend_servers { # Option 4: IP Hash - session persistence by client IP ip_hash; server 10.0.0.1:8080; server 10.0.0.2:8080; server 10.0.0.3:8080; }

upstream backend_servers { # Option 5: Consistent Hash - for cache efficiency hash $request_uri consistent; server 10.0.0.1:8080; server 10.0.0.2:8080; server 10.0.0.3:8080; } ```

Step 4: Configure Shared Zone

```nginx upstream backend_servers { # Shared memory zone for state across workers zone backend 64k;

least_conn; server 10.0.0.1:8080 weight=3; server 10.0.0.2:8080 weight=2; server 10.0.0.3:8080 weight=1;

# Keepalive connections keepalive 32; } ```

Step 5: Fix Session Persistence Issues

```nginx # Using sticky cookie for session persistence upstream backend_servers { zone backend 64k; server 10.0.0.1:8080; server 10.0.0.2:8080; server 10.0.0.3:8080; }

server { location / { # Use sticky cookie module (requires nginx-sticky-module) sticky cookie srv_id expires=1h domain=.example.com path=/;

proxy_pass http://backend_servers; } }

# Or use built-in hash for session persistence upstream backend_servers { hash $cookie_sessionid consistent; server 10.0.0.1:8080; server 10.0.0.2:8080; } ```

Step 6: Configure Health Checks

```nginx upstream backend_servers { zone backend 64k; least_conn;

# Passive health checks server 10.0.0.1:8080 max_fails=3 fail_timeout=30s; server 10.0.0.2:8080 max_fails=3 fail_timeout=30s; server 10.0.0.3:8080 max_fails=3 fail_timeout=30s; }

# Active health checks (requires nginx-plus or openresty) server { location @health_check { proxy_pass http://backend_servers/health; } } ```

Step 7: Monitor Distribution

```nginx # Add detailed logging log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" ' 'upstream=$upstream_addr ' 'upstream_status=$upstream_status ' 'request_time=$request_time ' 'upstream_response_time=$upstream_response_time';

access_log /var/log/nginx/access.log main; ```

```bash # Monitor distribution in real-time tail -f /var/log/nginx/access.log | grep -oP 'upstream=\K[^ ]+' | sort | uniq -c

# Check per-server distribution awk '{print $NF}' /var/log/nginx/access.log | grep upstream | cut -d: -f1 | sort | uniq -c ```

Advanced Diagnosis

Debug Upstream Selection

nginx

# Add upstream address to response headers for debugging
server {
    location / {
        proxy_pass http://backend_servers;
        add_header X-Upstream-Addr $upstream_addr always;
        add_header X-Upstream-Status $upstream_status always;
    }
}

Check Worker State

```bash # If using shared zone, check stats # Requires nginx-plus or stub_status module curl http://localhost:8080/status/upstreams

# Check process state ps aux | grep nginx ls /proc/$(cat /var/run/nginx.pid)/fd | wc -l ```

Test Distribution Mathematically

```bash # Run 1000 requests and check distribution for i in {1..1000}; do curl -s -o /dev/null -w "%{http_code}\n" http://localhost/test done | sort | uniq -c

# Check with different load balancing methods # Compare actual vs expected distribution ```

Common Pitfalls

Multiple upstream blocks - Different locations use different upstreams
Missing zone directive - Each worker maintains separate state
IP hash with proxy - All requests from proxy get same upstream
Weight mismatch with capacity - Weights don't reflect actual server capacity
Sticky sessions without failover - Users stuck to failed servers
Keepalive connection bias - Connection reuse favors certain servers
Slow backends with least_conn - Slow servers appear "less busy"

Best Practices

```nginx http { # Upstream with proper configuration upstream backend_servers { zone backend 64k; least_conn;

server 10.0.0.1:8080 weight=3 max_fails=3 fail_timeout=30s; server 10.0.0.2:8080 weight=2 max_fails=3 fail_timeout=30s; server 10.0.0.3:8080 weight=1 max_fails=3 fail_timeout=30s; server 10.0.0.4:8080 backup;

keepalive 32; keepalive_timeout 60s; }

server { listen 80;

location / { proxy_pass http://backend_servers; proxy_http_version 1.1; proxy_set_header Connection "";

# Health check retry proxy_next_upstream error timeout http_502 http_503 http_504; proxy_next_upstream_tries 2; }

# Debug endpoint location /upstream_status { upstream_show on; allow 127.0.0.1; deny all; } } } ```

Nginx Load Balancer Timeout
HAProxy Backend Down
HAProxy Health Check Failing
AWS ALB Target Unhealthy

Nginx Upstream Not Load Balancing Correctly

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Check Current Load Distribution

Step 2: Review Upstream Configuration

Step 3: Fix Load Balancing Algorithm

Step 4: Configure Shared Zone

Step 5: Fix Session Persistence Issues

Step 6: Configure Health Checks

Step 7: Monitor Distribution

Advanced Diagnosis

Debug Upstream Selection

Check Worker State

Test Distribution Mathematically

Common Pitfalls

Best Practices

Related Issues

Share this guide

More Load Balancer Troubleshooting Guides

Azure Front Door Routing Rule Not Matching

Azure Front Door Backend Unavailable

Azure Application Gateway SSL Certificate Missing

Azure Application Gateway WAF Blocks Legitimate

Azure Application Gateway 502

Azure Load Balancer Outbound Rule Not Working