Introduction
Load balancer sticky session failures occur when session affinity mechanisms fail to route related requests to the same backend server, causing user session data loss, authentication failures, shopping cart emptying, and inconsistent user experiences. Sticky sessions (also called session affinity or persistence) ensure that requests from the same client are consistently routed to the same backend server for the duration of a session. Common causes include load balancer cookie not being set or expired, cookie domain/path mismatch preventing browser from sending cookie, multiple load balancers without shared session state, backend server removed from pool while session active, session timeout shorter than user activity window, IP-based affinity failing due to NAT/proxy IP changes, WebSocket connections not respecting sticky session configuration, health check failures removing server with active sessions, connection draining not configured for rolling deployments, and client-side cookie blocking (privacy settings, ad blockers). The fix requires diagnosing whether the issue is load balancer configuration, cookie handling, backend health alignment, or session replication gaps. This guide provides production-proven troubleshooting for sticky session failures across cloud load balancers, NGINX, HAProxy, and Kubernetes ingress controllers.
Symptoms
- User logged out unexpectedly during active session
- Shopping cart contents disappear between requests
- Form submission fails with "invalid session" error
- User sees different data on consecutive page loads
- WebSocket disconnects and reconnects to different backend
- API requests fail with session validation errors
- Authentication state lost after load balancer deploy
- Some users work fine while others experience session loss
- Session loss correlates with auto-scaling events
- Health check failures cause mass session invalidation
Common Causes
- Sticky session cookie not set by load balancer
- Cookie domain doesn't match application domain
- Cookie path restriction prevents cookie from being sent
- Cookie TTL/expiration shorter than session duration
- Load balancer cookie blocked by browser privacy settings
- IP hash affinity broken by client IP changes (mobile, NAT, proxy)
- Backend server removed from pool (health check failure, scaling down)
- Rolling deployment terminates server with active sessions
- Connection draining not configured or timeout too short
- Multiple load balancers without shared session state
- WebSocket upgrade request routed to different backend
- SSL termination at load balancer breaks cookie encryption
Step-by-Step Fix
### 1. Diagnose sticky session configuration
Check current load balancer settings:
bash
# AWS ALB - Check target group attributes
aws elbv2 describe-target-group-attributes \
--target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/abc123 \
--query 'Attributes[?Key==stickiness.enabled || Key==stickiness.type || Key==stickiness.lb_cookie.duration_seconds`]'
# Expected output: # [ # {"Key": "stickiness.enabled", "Value": "true"}, # {"Key": "stickiness.type", "Value": "lb_cookie"}, # {"Key": "stickiness.lb_cookie.duration_seconds", "Value": "3600"} #
# AWS ALB - Check if stickiness cookie is being set # Use browser DevTools or curl to inspect cookies
curl -vI https://app.example.com/ \ | grep -i "set-cookie"
# Look for AWSALB or AWSALBCORS cookie: # Set-Cookie: AWSALB=abc123; Expires=Sun, 01 Apr 2026 15:30:00 GMT; Path=/
# AWS NLB - Check if source IP affinity is configured # NLB doesn't support cookie-based stickiness, only IP-based aws elbv2 describe-target-groups \ --target-group-arn <arn> \ --query 'TargetGroups[0].TargetType'
# For NLB, sticky sessions work at TCP level using source IP hash ```
Check NGINX sticky session configuration:
```bash # NGINX configuration # /etc/nginx/nginx.conf or /etc/nginx/conf.d/upstream.conf
grep -A 20 "upstream" /etc/nginx/nginx.conf
# Cookie-based sticky sessions (NGINX Plus) upstream backend { server backend1.example.com; server backend2.example.com; server backend3.example.com;
# NGINX Plus sticky sessions sticky cookie srv_id expires=1h domain=.example.com path=/;
# Or with hash method # sticky hash=$request_cookie_sessionid; }
# Open source NGINX - IP hash (less reliable) upstream backend { ip_hash; # Round-robin with IP affinity server backend1.example.com; server backend2.example.com; server backend3.example.com; }
# Test configuration nginx -t
# Reload after changes systemctl reload nginx ```
Check HAProxy sticky session configuration:
```bash # HAProxy configuration # /etc/haproxy/haproxy.cfg
grep -A 30 "backend" /etc/haproxy/haproxy.cfg
# Cookie-based persistence backend app_servers balance roundrobin
# Insert cookie with server ID cookie SERVERID insert indirect nocache
# Cookie settings cookie-expiration 1h cookie-domain .example.com cookie-path /
# Alternative: Use header-based persistence # stick-table type string len 64 size 100k expire 1h # stick on hdr(cookie,sessionid)
# Server definitions with cookie values server app1 backend1.example.com:8080 cookie app1 check server app2 backend2.example.com:8080 cookie app2 check server app3 backend3.example.com:8080 cookie app3 check
# Optional: Graceful shutdown on-marked-down shutdown-sessions
# IP-based persistence (alternative) backend app_servers balance source # Hash based on source IP stick-table type ip size 100k expire 1h stick on src server app1 backend1.example.com:8080 check server app2 backend2.example.com:8080 check ```
Check Kubernetes ingress sticky sessions:
```bash # Check ingress annotations for session affinity kubectl get ingress <ingress-name> -n <namespace> -o yaml
# Look for annotations: # nginx.ingress.kubernetes.io/affinity: "cookie" # nginx.ingress.kubernetes.io/session-cookie-name: "route" # nginx.ingress.kubernetes.io/session-cookie-expires: "3600" # nginx.ingress.kubernetes.io/session-cookie-max-age: "3600"
# For NGINX Ingress Controller # Check configmap configuration kubectl get configmap ingress-nginx-controller -n ingress-nginx -o yaml
# Service-level session affinity (Kubernetes 1.11+) kubectl get svc <service-name> -n <namespace> -o yaml
# Look for: # spec: # sessionAffinity: ClientIP # or None # sessionAffinityConfig: # clientIP: # timeoutSeconds: 10800 # 3 hours ```
### 2. Fix cookie-based sticky sessions
AWS ALB cookie configuration:
```bash # Enable application-based cookies (recommended over ALB cookies) # Application cookies give you full control over name, domain, path
# Option 1: Enable ALB-generated cookie (quick but limited) aws elbv2 modify-target-group-attributes \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/abc123 \ --attributes Key=stickiness.enabled,Value=true \ Key=stickiness.type,Value=lb_cookie \ Key=stickiness.lb_cookie.duration_seconds,Value=3600
# Option 2: Use application-based cookie (more control) aws elbv2 modify-target-group-attributes \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/abc123 \ --attributes Key=stickiness.enabled,Value=true \ Key=stickiness.type,Value=app_cookie \ Key=stickiness.app_cookie.cookie_name,Value=MYSESSIONID \ Key=stickiness.app_cookie.duration_seconds,Value=7200
# Cookie duration recommendations: # - Short sessions (APIs): 300-900 seconds # - Web sessions: 3600-7200 seconds # - Remember me: 604800 seconds (7 days)
# Verify cookie settings
aws elbv2 describe-target-group-attributes \
--target-group-arn <arn> \
--query 'Attributes[?starts_with(Key, stickiness)]'
```
NGINX cookie configuration:
```nginx # /etc/nginx/conf.d/sticky-session.conf
upstream backend { # NGINX Plus sticky session with cookie sticky cookie route_id expires=1h domain=.example.com path=/ secure httponly samesite=lax;
# Backend servers server 10.0.1.1:8080 weight=5; server 10.0.1.2:8080 weight=5; server 10.0.1.3:8080 weight=5 backup;
# Health checks health_check interval=5s fails=3 passes=2; }
server { listen 443 ssl; server_name app.example.com;
location / { proxy_pass http://backend;
# Preserve host header proxy_set_header Host $host;
# Forward client IP proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# WebSocket support (if needed) proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; } }
# For open source NGINX (without sticky module) # Use IP hash as fallback upstream backend { ip_hash; server 10.0.1.1:8080; server 10.0.1.2:8080; server 10.0.1.3:8080; } ```
HAProxy cookie configuration:
```haproxy # /etc/haproxy/haproxy.cfg
global # Enable stats socket for runtime management stats socket /var/run/haproxy.sock mode 660 level admin stats timeout 30s
defaults mode http timeout connect 5s timeout client 30s timeout server 30s
frontend http_front bind *:80 default_backend app_servers
# Optional: Insert session cookie at frontend level # http-response set-header Set-Cookie "ROUTEID=%[src],ip"
backend app_servers balance roundrobin
# Cookie-based persistence cookie ROUTEID insert indirect nocache domain .example.com path / secure httponly
# Cookie expiration (NGINX HAProxy 2.1+) # cookie-expiration 1h
# Health checks option httpchk GET /health http-check expect status 200
# Servers with cookie values server app1 10.0.1.1:8080 cookie app1 check inter 5s fall 3 rise 2 server app2 10.0.1.2:8080 cookie app2 check inter 5s fall 3 rise 2 server app3 10.0.1.3:8080 cookie app3 check inter 5s fall 3 rise 2
# Graceful server removal # When server is marked down, complete existing sessions on-marked-down shutdown-sessions
# Alternative: Use stick table for more control # stick-table type string len 64 size 100k expire 1h # stick on var(sess_cookie) # http-request set-var(sess_cookie) req.cookiename(ROUTEID) ```
### 3. Fix IP-based sticky sessions
AWS NLB source IP affinity:
```bash # Network Load Balancer uses source IP hash by default # No configuration needed, but be aware of limitations
# Check NLB configuration
aws elbv2 describe-load-balancers \
--query 'LoadBalancers[?Type==network].[LoadBalancerName,LoadBalancerArn]'
# Verify target group
aws elbv2 describe-target-groups \
--query 'TargetGroups[?TargetType==ip || TargetType==instance]'
# Limitations of IP-based affinity: # - Mobile users: IP changes when switching networks # - NAT: Multiple users behind same NAT share same IP # - Proxies: Corporate proxies mask real client IP # - IPv6: Dual-stack clients may have varying IPs
# For better reliability, use ALB with cookie-based sessions ```
NGINX IP hash configuration:
```nginx # Open source NGINX - IP hash sticky sessions
upstream backend { ip_hash; # Consistent hash based on client IP
# Backend servers server 10.0.1.1:8080 weight=3; server 10.0.1.2:8080 weight=3; server 10.0.1.3:8080 weight=3; server 10.0.1.4:8080 backup; # Only used when others fail }
server { listen 80;
location / { proxy_pass http://backend; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr;
# Preserve client IP for backend logging proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; } }
# Note: ip_hash has limitations # - All clients behind same NAT go to same backend # - Mobile clients may switch backends on network change # - Not suitable for IPv6 without additional configuration ```
Kubernetes ClientIP affinity:
```yaml # Service with ClientIP session affinity apiVersion: v1 kind: Service metadata: name: my-service namespace: production spec: # Session affinity based on client IP sessionAffinity: ClientIP
# Configure timeout sessionAffinityConfig: clientIP: timeoutSeconds: 10800 # 3 hours default, max 86400 (24h)
selector: app: my-app
ports: - name: http port: 80 targetPort: 8080 protocol: TCP
type: ClusterIP # or LoadBalancer
# For external traffic (LoadBalancer type) apiVersion: v1 kind: Service metadata: name: my-service-lb spec: sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 3600
type: LoadBalancer
# Preserve client source IP externalTrafficPolicy: Local
selector: app: my-app
ports: - port: 80 targetPort: 8080
# Note: externalTrafficPolicy: Local is required for # source IP preservation with LoadBalancer type ```
### 4. Fix connection draining
AWS ALB connection draining:
```bash # AWS ALB uses "connection termination" settings # Configure for graceful shutdown during deployments
# Modify target group for connection draining aws elbv2 modify-target-group-attributes \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/abc123 \ --attributes Key=deregistration_delay.timeout_seconds,Value=300
# Deregistration delay options: # - 0-3600 seconds # - Recommended: 300 seconds (5 minutes) for web apps # - Recommended: 60 seconds for stateless APIs
# Verify settings
aws elbv2 describe-target-group-attributes \
--target-group-arn <arn> \
--query 'Attributes[?Key==deregistration_delay.timeout_seconds]'
# For Lambda targets aws elbv2 modify-target-group-attributes \ --target-group-arn <arn> \ --attributes Key=lambda_multi_value_headers.enabled,Value=true
# Monitor draining status aws elbv2 describe-target-health \ --target-group-arn <arn>
# Target states: # - initial: Waiting for health checks # - healthy: Receiving traffic # - draining: Connection draining in progress # - unused: No targets, no traffic # - unhealthy: Failing health checks ```
NGINX connection draining:
```nginx # NGINX Plus - Graceful shutdown configuration
upstream backend { server 10.0.1.1:8080; server 10.0.1.2:8080; server 10.0.1.3:8080;
# Zone for state sharing (required for draining) zone backend_zone 64k; }
server { listen 80;
location / { proxy_pass http://backend;
# Next upstream on failure proxy_next_upstream error timeout http_502 http_503 http_504; proxy_next_upstream_tries 3; proxy_next_upstream_timeout 10s; } }
# Graceful reload (preserves existing connections) # nginx -s reload
# For draining specific servers via API (NGINX Plus) # curl -X PATCH -d '{"state":"draining"}' \ # http://localhost:8000/api/3/http/upstreams/backend/servers/1 ```
HAProxy graceful shutdown:
```haproxy # Graceful server removal configuration
backend app_servers balance roundrobin cookie ROUTEID insert indirect nocache
# Health check configuration option httpchk GET /health http-check expect status 200 inter 5s fall 3 rise 2
# Graceful shutdown settings on-marked-down shutdown-sessions # Close sessions when marked down
# Or use drain mode (stop accepting new connections) # server app1 10.0.1.1:8080 cookie app1 check drain
server app1 10.0.1.1:8080 cookie app1 check server app2 10.0.1.2:8080 cookie app2 check server app3 10.0.1.3:8080 cookie app3 check
# Drain a server via runtime (stop new connections, drain existing) # echo "set server app_servers/app1 state drain" | \ # socat stdio /var/run/haproxy.sock
# Remove server from rotation (graceful) # echo "disable server app_servers/app1" | \ # socat stdio /var/run/haproxy.sock
# Check server status # echo "show servers state" | socat stdio /var/run/haproxy.sock
# Wait for connections to drain before stopping # Watch active connections: # watch 'echo "show stat" | socat stdio /var/run/haproxy.sock | \ # grep app1 | cut -d, -f8,11' ```
Kubernetes graceful termination:
```yaml # Deployment with graceful termination apiVersion: apps/v1 kind: Deployment metadata: name: my-app namespace: production spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 # Don't reduce capacity during rollout
template: spec: # Pre-stop hook for graceful shutdown lifecycle: preStop: exec: # Sleep to allow traffic drain before SIGTERM command: ["/bin/sh", "-c", "sleep 30"] # Or use nginx -s quit for NGINX # command: ["nginx", "-s", "quit"]
# Termination grace period terminationGracePeriodSeconds: 60
containers: - name: app image: my-app:latest ports: - containerPort: 8080
# Health checks livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 10
readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 3 ```
### 5. Fix health check and session correlation
Configure health checks to preserve sessions:
```yaml # AWS ALB - Health check configuration aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/abc123 \ --health-check-path /health \ --health-check-interval-seconds 30 \ --health-check-timeout-seconds 5 \ --healthy-threshold-count 2 \ --unhealthy-threshold-count 3 \ --health-check-protocol HTTP \ --health-check-port 8080 \ --matcher HttpStatusCode=200
# Health check best practices: # - Use separate health endpoint (/health) from readiness (/ready) # - Health should check service is running # - Readiness should check dependencies (DB, cache) are available # - Interval: 30s for production, 10s for critical services # - Timeout: 5s (fail fast if hung) # - Unhealthy threshold: 3 (avoid flapping) ```
```yaml # Kubernetes - Separate health and readiness probes apiVersion: v1 kind: Pod metadata: name: my-app spec: containers: - name: app image: my-app:latest
# Liveness probe - is the process running? livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 15 periodSeconds: 20 timeoutSeconds: 5 failureThreshold: 3 successThreshold: 1
# Readiness probe - can it receive traffic? readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 successThreshold: 1
# Startup probe - for slow starting apps startupProbe: httpGet: path: /healthz port: 8080 failureThreshold: 30 periodSeconds: 10 # Gives 300 seconds for startup ```
HAProxy health check with session awareness:
```haproxy # Health check that considers active sessions
backend app_servers balance roundrobin cookie ROUTEID insert indirect nocache
# Health check configuration option httpchk GET /health http-check expect status 200 http-check disable-on-404 # Mark as draining on 404
# Interrogate server state inter 5s fall 3 rise 2
# Don't remove server if it has active sessions # track-sc0 src # Track source IP # stick-store first
# Slow start after recovery (prevent thundering herd) # server app1 10.0.1.1:8080 cookie app1 check slowstart 60s
server app1 10.0.1.1:8080 cookie app1 check inter 5s fall 3 rise 2 server app2 10.0.1.2:8080 cookie app2 check inter 5s fall 3 rise 2 server app3 10.0.1.3:8080 cookie app3 check inter 5s fall 3 rise 2
# Slow start configuration # After server recovers, gradually increase traffic # server app1 10.0.1.1:8080 cookie app1 check \ # slowstart 60s \ # init-addr last,libc,none ```
### 6. Debug sticky session issues
Debug cookie-based sessions:
```bash # Check if sticky session cookie is being set curl -vI https://app.example.com/ 2>&1 | grep -i "set-cookie"
# Expected: Set-Cookie: AWSALB=abc123; Path=/; Domain=.example.com # Expected: Set-Cookie: ROUTEID=app1; Path=/; Domain=.example.com
# Verify cookie is sent on subsequent requests curl -vI -b "ROUTEID=app1" https://app.example.com/ 2>&1 | \ grep -E "Cookie:|X-Backend-Server"
# Test session persistence across multiple requests for i in {1..10}; do curl -s -b "ROUTEID=app1" https://app.example.com/api/hostname echo "" done
# All responses should return same backend hostname
# Check cookie domain/path issues # Domain must match or be parent of request domain # Path must match or be parent of request path
# Cookie debugging in browser: # 1. Open DevTools > Application > Cookies # 2. Check cookie attributes: # - Domain: should be .example.com or app.example.com # - Path: should be / or specific path # - Secure: should be true for HTTPS # - SameSite: Lax or None for cross-origin # 3. Check cookie is sent in requests (Network tab) ```
Check backend server distribution:
```bash # Test load distribution with sticky sessions # Without sticky session cookie (should round-robin) for i in {1..20}; do curl -s https://app.example.com/api/hostname done | sort | uniq -c
# With sticky session cookie (should all go to same backend) for i in {1..20}; do curl -s -c cookies.txt -b cookies.txt https://app.example.com/api/hostname done | sort | uniq -c
# Check cookie persistence over time # Run over several minutes to verify session duration while true; do timestamp=$(date +%H:%M:%S) backend=$(curl -s -c cookies.txt -b cookies.txt https://app.example.com/api/hostname) echo "$timestamp: $backend" sleep 10 done ```
Monitor session affinity metrics:
```bash # AWS ALB - CloudWatch metrics aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name RequestCountPerTarget \ --dimensions Name=LoadBalancer,Value=app/my-alb/abc123 \ Name=TargetGroup,Value=targetgroup/my-tg/abc123 \ --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Sum
# NGINX - Access log analysis # Check backend distribution awk -F'"' '{print $2}' /var/log/nginx/access.log | \ awk '{print $1}' | sort | uniq -c | sort -rn
# HAProxy - Stats socket echo "show stat" | socat stdio /var/run/haproxy.sock | \ cut -d',' -f1,2,8,11 | column -t -s','
# Columns: pxname,svname,curr_conn,curr_req # Look for uneven distribution ```
Prevention
- Use application-based cookies instead of load balancer cookies for more control
- Set cookie duration longer than typical session length
- Configure connection draining for rolling deployments
- Implement session replication or shared session store (Redis)
- Use health checks that don't disrupt active sessions
- Monitor sticky session distribution for imbalances
- Document session affinity requirements for each service
- Test sticky session behavior during deployment simulations
- Consider stateless architecture to eliminate sticky session dependency
- Implement session invalidation hooks for graceful degradation
Related Errors
- **503 Service Unavailable**: No healthy backends available
- **502 Bad Gateway**: Backend connection failed
- **504 Gateway Timeout**: Backend response timeout
- **Session expired**: Application-level session timeout
- **Connection refused**: Backend not accepting connections