Introduction
WebSocket connections dropping unexpectedly occurs when persistent WebSocket connections are terminated prematurely by intermediaries (proxies, load balancers, firewalls) or due to missing keepalive mechanisms. Unlike HTTP requests which are stateless and short-lived, WebSocket connections are long-lived bidirectional channels that can be killed by idle timeouts, network interruptions, or protocol errors. When connections drop, real-time features like chat, notifications, collaborative editing, or live updates fail, requiring reconnection logic to restore functionality.
Symptoms
- WebSocket disconnects after consistent idle period (30s, 60s, 300s - indicates timeout)
WebSocket connection closedorConnection reset by peererrors in console- Works on local development but fails behind proxy/load balancer
- Mobile networks show more frequent disconnections than WiFi
- Connection drops during idle periods but stays stable during active messaging
- Issue appears after deploying behind Nginx/Envoy, enabling cloud load balancer, or network infrastructure changes
Common Causes
- Load balancer or proxy idle timeout shorter than WebSocket idle period
- Missing WebSocket heartbeat/ping-pong implementation
- Nginx/Apache proxy not configured for WebSocket upgrade
- Firewall or NAT device dropping idle TCP connections
- Cloud provider idle timeout (AWS ALB: 350s, Azure: 230s, GCP: 600s)
- Client behind corporate proxy terminating WebSocket connections
- Server process restart without connection draining
- Network interruption without reconnection logic
Step-by-Step Fix
### 1. Enable WebSocket debug logging
Capture disconnection events:
```javascript // Client-side logging const ws = new WebSocket('wss://example.com/ws');
ws.addEventListener('open', () => { console.log('WebSocket connected'); });
ws.addEventListener('close', (event) => { console.log('WebSocket closed:', { code: event.code, reason: event.reason, wasClean: event.wasClean }); // Codes: // 1000: Normal closure // 1001: Endpoint going away // 1006: Abnormal closure (network issue) // 1011: Server error });
ws.addEventListener('error', (error) => { console.error('WebSocket error:', error); });
// Server-side logging (Node.js example)
wss.on('connection', (ws, req) => {
const clientId = req.socket.remoteAddress;
console.log(Client ${clientId} connected);
ws.on('close', (code, reason) => {
console.log(Client ${clientId} closed: ${code} ${reason});
});
ws.on('error', (error) => {
console.error(Client ${clientId} error:, error.message);
});
});
```
### 2. Implement WebSocket heartbeat (ping-pong)
Heartbeat prevents idle timeout disconnections:
```javascript // Client-side heartbeat const HEARTBEAT_INTERVAL = 30000; // 30 seconds
class WebSocketClient { constructor(url) { this.url = url; this.ws = null; this.pingInterval = null; this.pongTimeout = null; this.connect(); }
connect() { this.ws = new WebSocket(this.url);
this.ws.addEventListener('open', () => { console.log('Connected'); this.startHeartbeat(); });
this.ws.addEventListener('message', (event) => { // Handle pong response if (event.data === 'pong') { this.handlePong(); } // Handle application messages this.onMessage(event.data); });
this.ws.addEventListener('close', () => { this.stopHeartbeat(); // Auto-reconnect with backoff setTimeout(() => this.connect(), this.getBackoffDelay()); }); }
startHeartbeat() { // Send ping periodically this.pingInterval = setInterval(() => { if (this.ws.readyState === WebSocket.OPEN) { this.ws.send('ping'); // Set timeout for pong response this.pongTimeout = setTimeout(() => { console.warn('Pong timeout, reconnecting'); this.ws.close(); }, 10000); // 10 second pong timeout } }, HEARTBEAT_INTERVAL); }
handlePong() { // Cancel pong timeout if (this.pongTimeout) { clearTimeout(this.pongTimeout); this.pongTimeout = null; } }
stopHeartbeat() { if (this.pingInterval) clearInterval(this.pingInterval); if (this.pongTimeout) clearTimeout(this.pongTimeout); }
getBackoffDelay() { // Exponential backoff: 1s, 2s, 4s, 8s, 16s, max 30s this.retryCount = (this.retryCount || 0) + 1; return Math.min(1000 * Math.pow(2, this.retryCount), 30000); }
onMessage(data) { // Application message handler console.log('Message received:', data); } }
// Usage const client = new WebSocketClient('wss://example.com/ws'); ```
Server-side heartbeat (Node.js):
```javascript const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });
// Ping all clients periodically const pingInterval = setInterval(() => { wss.clients.forEach((ws) => { if (ws.isAlive === false) { // No pong received, terminate connection return ws.terminate(); } ws.isAlive = false; ws.ping(); // Send ping frame }); }, 30000);
wss.on('connection', (ws) => { ws.isAlive = true;
// Mark alive on pong ws.on('pong', () => { ws.isAlive = true; });
ws.on('close', () => { ws.isAlive = false; }); });
wss.on('close', () => { clearInterval(pingInterval); }); ```
### 3. Configure Nginx for WebSocket
Nginx must be configured to handle WebSocket upgrade:
```nginx # WRONG: Standard HTTP proxy config drops WebSocket location /ws/ { proxy_pass http://backend; proxy_http_version 1.1; # WRONG: WebSocket needs HTTP/1.1 upgrade }
# CORRECT: WebSocket-aware proxy configuration location /ws/ { proxy_pass http://backend;
# Required for WebSocket upgrade proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade";
# Timeout settings (must exceed expected idle periods) proxy_read_timeout 86400s; # 24 hours proxy_send_timeout 86400s; proxy_connect_timeout 60s;
# Buffer settings proxy_buffering off; proxy_buffer_size 4k; proxy_buffers 8 4k;
# Headers proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme;
# Optional: Limit concurrent connections proxy_max_conn 1000; }
# For high-traffic deployments upstream websocket_backend { server backend1:8080 max_fails=3 fail_timeout=30s; server backend2:8080 max_fails=3 fail_timeout=30s;
# Use IP hash for sticky sessions (WebSocket state) ip_hash;
# Or use least_conn for better load distribution # least_conn; } ```
### 4. Configure load balancer timeouts
Cloud load balancers have idle timeout limits:
```yaml # AWS Application Load Balancer # Default idle timeout: 350 seconds (6 minutes) # Maximum: 4000 seconds
# Increase timeout via AWS CLI: aws elbv2 modify-listener-attributes \ --listener-arn arn:aws:elasticloadbalancing:region:account:listener/app/... \ --attributes Key=idle_timeout.timeout_seconds,Value=600
# Or via Terraform: resource "aws_lb_listener" "ws" { load_balancer_arn = aws_lb.main.arn port = 443 protocol = "HTTPS" ssl_policy = "ELBSecurityPolicy-2016-08" certificate_arn = aws_acm_certificate.main.arn
idle_timeout = 600 # 10 minutes
default_action { type = "forward" target_group_arn = aws_lb_target_group.ws.arn } }
# Azure Application Gateway # Default timeout: 230 seconds # Increase via ARM template or portal: # Application Gateway > HTTP settings > Timeout > 600
# GCP Cloud Load Balancing # Default timeout: 600 seconds (10 minutes) # Maximum: 3600 seconds
# gcloud command: gcloud compute backend-services update ws-backend \ --global \ --timeout=600s ```
### 5. Implement client-side reconnection logic
Robust reconnection with exponential backoff:
```javascript class ResilientWebSocket { constructor(url, options = {}) { this.url = url; this.options = { maxRetries: options.maxRetries || 10, initialDelay: options.initialDelay || 1000, maxDelay: options.maxDelay || 30000, minDelay: options.minDelay || 1000, ...options }; this.retryCount = 0; this.ws = null; this.messageQueue = []; this.eventHandlers = { open: [], message: [], close: [], error: [] };
this.connect(); }
connect() { try { this.ws = new WebSocket(this.url);
this.ws.addEventListener('open', (event) => {
console.log(WebSocket connected (attempt ${this.retryCount + 1}));
this.retryCount = 0;
this.flushMessageQueue();
this.emit('open', event);
});
this.ws.addEventListener('message', (event) => { this.emit('message', event); });
this.ws.addEventListener('error', (error) => { console.error('WebSocket error:', error); this.emit('error', error); });
this.ws.addEventListener('close', (event) => {
console.log(WebSocket closed: ${event.code} ${event.reason});
this.emit('close', event);
// Attempt reconnection if (this.shouldReconnect(event.code)) { this.scheduleReconnect(); } });
} catch (error) { console.error('Connection error:', error); this.scheduleReconnect(); } }
shouldReconnect(code) { // Don't reconnect on normal closure or policy violations if (code === 1000 || code === 1008 || code === 1003) { return false; } return this.retryCount < this.options.maxRetries; }
scheduleReconnect() {
const delay = this.calculateBackoff();
console.log(Reconnecting in ${delay}ms (attempt ${this.retryCount + 1}/${this.options.maxRetries}));
setTimeout(() => { this.retryCount++; this.connect(); }, delay); }
calculateBackoff() { // Exponential backoff with jitter const exponentialDelay = Math.min( this.options.initialDelay * Math.pow(2, this.retryCount), this.options.maxDelay );
// Add jitter (±25%) const jitter = exponentialDelay * 0.25 * (Math.random() * 2 - 1); return Math.max(this.options.minDelay, exponentialDelay + jitter); }
send(data) { if (this.ws && this.ws.readyState === WebSocket.OPEN) { this.ws.send(data); } else { // Queue message for later this.messageQueue.push(data); console.log('Message queued, will send when connected'); } }
flushMessageQueue() { while (this.messageQueue.length > 0 && this.ws.readyState === WebSocket.OPEN) { const message = this.messageQueue.shift(); this.ws.send(message); } }
on(event, handler) { if (this.eventHandlers[event]) { this.eventHandlers[event].push(handler); } }
emit(event, data) { if (this.eventHandlers[event]) { this.eventHandlers[event].forEach(handler => handler(data)); } }
close() { this.options.maxRetries = 0; // Prevent reconnection if (this.ws) { this.ws.close(1000, 'Client initiated close'); } } }
// Usage const ws = new ResilientWebSocket('wss://example.com/ws');
ws.on('open', () => { console.log('Connected!'); ws.send('Hello server'); });
ws.on('message', (event) => { console.log('Received:', event.data); });
ws.on('close', (event) => { console.log('Disconnected:', event.code, event.reason); }); ```
### 6. Check firewall and NAT configuration
Firewalls drop idle TCP connections:
```bash # Check TCP keepalive settings sysctl net.ipv4.tcp_keepalive_time sysctl net.ipv4.tcp_keepalive_intvl sysctl net.ipv4.tcp_keepalive_probes
# Typical values: # tcp_keepalive_time = 7200 (2 hours) # tcp_keepalive_intvl = 75 (75 seconds) # tcp_keepalive_probes = 9 (9 probes)
# For WebSocket, reduce keepalive time sudo sysctl -w net.ipv4.tcp_keepalive_time=300 sudo sysctl -w net.ipv4.tcp_keepalive_intvl=30 sudo sysctl -w net.ipv4.tcp_keepalive_probes=5
# Check firewall idle timeout # Common values: # - AWS Security Groups: 350 seconds # - Azure NSG: 4 minutes for TCP # - Corporate firewalls: 5-15 minutes
# Test connection persistence timeout 400 bash -c 'exec 3<>/dev/tcp/example.com/443; while true; do sleep 60; echo "alive"; done'
# If connection drops before timeout, intermediate device has shorter idle limit ```
### 7. Handle server-side connection limits
Server connection limits cause drops:
```javascript // Node.js: Increase max sockets const http = require('http'); http.globalAgent.maxSockets = Infinity; http.globalAgent.maxFreeSockets = Infinity;
// Or per WebSocket server const wss = new WebSocket.Server({ port: 8080, maxPayload: 1024 * 1024, // 1MB max message clientTracking: true, perMessageDeflate: { threshold: 1024, // Only compress messages > 1KB zlibDeflateOptions: { chunkSize: 16 * 1024 }, zlibInflateOptions: { chunkSize: 16 * 1024 } } });
// Monitor connection count
setInterval(() => {
console.log(Active connections: ${wss.clients.size});
}, 60000);
// Limit connections per IP (prevent abuse) const clientCounts = new Map();
wss.on('connection', (ws, req) => { const ip = req.socket.remoteAddress; const count = clientCounts.get(ip) || 0;
if (count >= 10) { // Max 10 connections per IP ws.close(1013, 'Too many connections'); return; }
clientCounts.set(ip, count + 1);
ws.on('close', () => { clientCounts.set(ip, clientCounts.get(ip) - 1); }); }); ```
### 8. Configure cloud-specific settings
Cloud provider specific configurations:
```yaml # AWS API Gateway WebSocket # Default idle timeout: 10 minutes # Maximum: 29 minutes
# Via SAM template: Resources: WebSocketApi: Type: AWS::ApiGatewayV2::Api Properties: Name: my-ws-api ProtocolType: WEBSOCKET RouteSelectionExpression: "$request.body.action"
WebSocketStage: Type: AWS::ApiGatewayV2::Stage Properties: ApiId: !Ref WebSocketApi StageName: production DefaultRouteSettings: ThrottlingBurstLimit: 1000 ThrottlingRateLimit: 500
# Heroku WebSocket # Timeout: 55 seconds (must send data before this)
# Via Procfile: web: node server.js
# Enable WebSocket in app.json: { "addons": [], "buildpacks": [{"url": "heroku/nodejs"}], "env": { "WEB_CONCURRENCY": "4", "WS_HEARTBEAT": "30000" } }
# Google Cloud Run WebSocket # Timeout: 60 minutes maximum # Must set concurrency appropriately
# Via gcloud: gcloud run deploy my-ws-service \ --image gcr.io/project/image \ --platform managed \ --timeout=3600s \ --concurrency=80 ```
### 9. Monitor WebSocket connections
Add monitoring and alerting:
```javascript // Prometheus metrics for WebSocket const client = require('prom-client');
const wsConnections = new client.Gauge({ name: 'websocket_connections_active', help: 'Number of active WebSocket connections' });
const wsConnectionsTotal = new client.Counter({ name: 'websocket_connections_total', help: 'Total WebSocket connections created' });
const wsMessagesReceived = new client.Counter({ name: 'websocket_messages_received_total', help: 'Total WebSocket messages received' });
const wsMessagesSent = new client.Counter({ name: 'websocket_messages_sent_total', help: 'Total WebSocket messages sent' });
const wsDuration = new client.Histogram({ name: 'websocket_connection_duration_seconds', help: 'Duration of WebSocket connections', buckets: [60, 300, 600, 1800, 3600, 7200, 14400] });
// Track connections wss.on('connection', (ws, req) => { const startTime = Date.now(); wsConnections.inc(); wsConnectionsTotal.inc();
ws.on('close', () => { wsConnections.dec(); const duration = (Date.now() - startTime) / 1000; wsDuration.observe(duration); });
ws.on('message', () => { wsMessagesReceived.inc(); }); });
// Alert thresholds (Prometheus): // websocket_connections_active > 10000: Warning // websocket_connections_active > 50000: Critical // rate(websocket_connections_total[5m]) dropping: Connection issue ```
### 10. Test WebSocket resilience
Load test WebSocket handling:
```bash # Install wsbench for WebSocket load testing npm install -g wsbench
# Run load test wsbench run -c 100 -d 60 -r 10 wss://example.com/ws
# Options: # -c: Concurrent connections # -d: Duration in seconds # -r: Messages per second per connection
# Or use artillery npm install -g artillery
# artillery-config.yml config: target: "wss://example.com" phases: - duration: 60 arrivalRate: 10 ws: - send: "ping" expect: - type: "message" value: "pong"
# Run: artillery run artillery-config.yml
# Check connection stability # - Connections should remain stable # - No unexpected closes # - Ping/pong should succeed consistently ```
Prevention
- Implement heartbeat with interval < 1/2 of shortest timeout
- Configure proxy/load balancer timeouts > 5 minutes
- Use exponential backoff with jitter for reconnection
- Queue messages during disconnection for later delivery
- Monitor active connection count and drop rate
- Test WebSocket behavior under load before production
- Document timeout settings for all infrastructure components
- Use sticky sessions for stateful WebSocket backends
Related Errors
- **Connection closed (1006)**: Abnormal closure, network issue
- **Connection reset by peer**: Server or proxy terminated connection
- **502 Bad Gateway**: Proxy cannot connect to WebSocket backend
- **504 Gateway Timeout**: Proxy timeout waiting for WebSocket