Introduction
When Nginx load balancing appears uneven, some backend servers receive significantly more traffic than others, or traffic doesn't follow the expected distribution. This can result from configuration errors, health check issues, session persistence, or algorithm misconfigurations. Uneven load distribution causes some servers to be overloaded while others remain underutilized.
Symptoms
Observable indicators: - One backend server shows significantly higher CPU/memory usage - Request counts per server vary widely in access logs - Some servers show high error rates while others are fine - Nginx status shows all servers UP but distribution is uneven - Sticky sessions causing traffic imbalance
Error patterns in logs:
``
backend1: 10000 requests
backend2: 100 requests # Clearly imbalanced
Common Causes
- 1.Weight not configured - All servers treated equally despite different capacities
- 2.IP hash distribution - Uneven client IP distribution
- 3.Sticky sessions - Session persistence pinning users to specific servers
- 4.Keepalive connections - Connection reuse favoring certain servers
- 5.Health checks marking servers down - Traffic concentrated on remaining servers
- 6.Least_conn without considering capacity - Slow servers getting more connections
- 7.Upstream zone not shared - Each worker has different state
Step-by-Step Fix
Step 1: Check Current Load Distribution
```bash # Check Nginx upstream status curl http://localhost/nginx_status 2>/dev/null
# Analyze access logs for distribution awk '{print $upstream_addr}' /var/log/nginx/access.log | sort | uniq -c | sort -rn
# Or if upstream address in specific position cat /var/log/nginx/access.log | grep -oP 'upstream: "\K[^"]+' | sort | uniq -c
# Check backend server metrics for server in 10.0.0.1 10.0.0.2 10.0.0.3; do echo "=== $server ===" ssh $server 'uptime; netstat -an | grep :8080 | wc -l' done ```
Step 2: Review Upstream Configuration
```bash # Show current configuration nginx -T 2>/dev/null | grep -A20 "upstream"
# Check for load balancing method nginx -T 2>/dev/null | grep -E "upstream|ip_hash|least_conn|hash" ```
Step 3: Fix Load Balancing Algorithm
```nginx upstream backend_servers { # Option 1: Round Robin (default) - equal distribution server 10.0.0.1:8080; server 10.0.0.2:8080; server 10.0.0.3:8080; }
upstream backend_servers { # Option 2: Weighted Round Robin - capacity-based distribution server 10.0.0.1:8080 weight=3; # 3x traffic server 10.0.0.2:8080 weight=2; # 2x traffic server 10.0.0.3:8080 weight=1; # 1x traffic }
upstream backend_servers { # Option 3: Least Connections - for varying request durations least_conn; server 10.0.0.1:8080; server 10.0.0.2:8080; server 10.0.0.3:8080; }
upstream backend_servers { # Option 4: IP Hash - session persistence by client IP ip_hash; server 10.0.0.1:8080; server 10.0.0.2:8080; server 10.0.0.3:8080; }
upstream backend_servers { # Option 5: Consistent Hash - for cache efficiency hash $request_uri consistent; server 10.0.0.1:8080; server 10.0.0.2:8080; server 10.0.0.3:8080; } ```
Step 4: Configure Shared Zone
```nginx upstream backend_servers { # Shared memory zone for state across workers zone backend 64k;
least_conn; server 10.0.0.1:8080 weight=3; server 10.0.0.2:8080 weight=2; server 10.0.0.3:8080 weight=1;
# Keepalive connections keepalive 32; } ```
Step 5: Fix Session Persistence Issues
```nginx # Using sticky cookie for session persistence upstream backend_servers { zone backend 64k; server 10.0.0.1:8080; server 10.0.0.2:8080; server 10.0.0.3:8080; }
server { location / { # Use sticky cookie module (requires nginx-sticky-module) sticky cookie srv_id expires=1h domain=.example.com path=/;
proxy_pass http://backend_servers; } }
# Or use built-in hash for session persistence upstream backend_servers { hash $cookie_sessionid consistent; server 10.0.0.1:8080; server 10.0.0.2:8080; } ```
Step 6: Configure Health Checks
```nginx upstream backend_servers { zone backend 64k; least_conn;
# Passive health checks server 10.0.0.1:8080 max_fails=3 fail_timeout=30s; server 10.0.0.2:8080 max_fails=3 fail_timeout=30s; server 10.0.0.3:8080 max_fails=3 fail_timeout=30s; }
# Active health checks (requires nginx-plus or openresty) server { location @health_check { proxy_pass http://backend_servers/health; } } ```
Step 7: Monitor Distribution
```nginx # Add detailed logging log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" ' 'upstream=$upstream_addr ' 'upstream_status=$upstream_status ' 'request_time=$request_time ' 'upstream_response_time=$upstream_response_time';
access_log /var/log/nginx/access.log main; ```
```bash # Monitor distribution in real-time tail -f /var/log/nginx/access.log | grep -oP 'upstream=\K[^ ]+' | sort | uniq -c
# Check per-server distribution awk '{print $NF}' /var/log/nginx/access.log | grep upstream | cut -d: -f1 | sort | uniq -c ```
Advanced Diagnosis
Debug Upstream Selection
# Add upstream address to response headers for debugging
server {
location / {
proxy_pass http://backend_servers;
add_header X-Upstream-Addr $upstream_addr always;
add_header X-Upstream-Status $upstream_status always;
}
}Check Worker State
```bash # If using shared zone, check stats # Requires nginx-plus or stub_status module curl http://localhost:8080/status/upstreams
# Check process state ps aux | grep nginx ls /proc/$(cat /var/run/nginx.pid)/fd | wc -l ```
Test Distribution Mathematically
```bash # Run 1000 requests and check distribution for i in {1..1000}; do curl -s -o /dev/null -w "%{http_code}\n" http://localhost/test done | sort | uniq -c
# Check with different load balancing methods # Compare actual vs expected distribution ```
Common Pitfalls
- Multiple upstream blocks - Different locations use different upstreams
- Missing zone directive - Each worker maintains separate state
- IP hash with proxy - All requests from proxy get same upstream
- Weight mismatch with capacity - Weights don't reflect actual server capacity
- Sticky sessions without failover - Users stuck to failed servers
- Keepalive connection bias - Connection reuse favors certain servers
- Slow backends with least_conn - Slow servers appear "less busy"
Best Practices
```nginx http { # Upstream with proper configuration upstream backend_servers { zone backend 64k; least_conn;
server 10.0.0.1:8080 weight=3 max_fails=3 fail_timeout=30s; server 10.0.0.2:8080 weight=2 max_fails=3 fail_timeout=30s; server 10.0.0.3:8080 weight=1 max_fails=3 fail_timeout=30s; server 10.0.0.4:8080 backup;
keepalive 32; keepalive_timeout 60s; }
server { listen 80;
location / { proxy_pass http://backend_servers; proxy_http_version 1.1; proxy_set_header Connection "";
# Health check retry proxy_next_upstream error timeout http_502 http_503 http_504; proxy_next_upstream_tries 2; }
# Debug endpoint location /upstream_status { upstream_show on; allow 127.0.0.1; deny all; } } } ```
Related Issues
- Nginx Load Balancer Timeout
- HAProxy Backend Down
- HAProxy Health Check Failing
- AWS ALB Target Unhealthy