Cassandra Client Driver Reconnection Storm - Fix After Recovery

Introduction When a Cassandra cluster recovers from an outage, hundreds or thousands of client connections attempt to reconnect simultaneously. This reconnection storm can overwhelm the cluster's connection handling capacity, causing authentication timeouts, request queuing, and potentially triggering a second outage.

Symptoms - Cassandra logs show connection storms with `Too many connections from /x.x.x.x` - `nodetool tpstats` shows `MutationStage` or `ReadStage` pools with high pending tasks - Authentication timeouts as the cluster processes thousands of simultaneous connections - Application clients report `NoHostAvailableException` despite cluster being up - CPU spikes on Cassandra nodes from connection processing

Common Causes - All application instances restarting simultaneously after cluster outage - Client driver reconnection policy with no backoff or jitter - Connection pool configured to create maximum connections immediately - No rate limiting on new connections at the Cassandra level - Load balancer forwarding all reconnections to a single node

Step-by-Step Fix 1. **Check current connection statistics": ```bash nodetool netstats # Shows: Mode, Not used in 2.1+, Pool Name, Active, Pending, Completed

nodetool tpstats # Shows thread pool status including connection handling ```

1.**Configure the client driver with exponential backoff":
2.```python
3.from cassandra.cluster import Cluster
4.from cassandra.policies import DCAwareRoundRobinPolicy, ExponentialReconnectionPolicy

cluster = Cluster( contact_points=['node1', 'node2', 'node3'], load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='dc1'), reconnection_policy=ExponentialReconnectionPolicy( base_delay=1.0, # Start with 1 second max_delay=60.0 # Max 60 seconds between retries ), protocol_version=4 ) session = cluster.connect('mykeyspace') ```

1.**Add jitter to prevent thundering herd":
2.```python
3.import random
4.import time
5.from cassandra.policies import ReconnectionPolicy, ReconnectionSchedule

class JitteredExponentialReconnectionPolicy(ReconnectionPolicy): def new_schedule(self, *args, **kwargs): return JitteredReconnectionSchedule()

class JitteredReconnectionSchedule(ReconnectionSchedule): def __init__(self): self.attempt = 0

def reset(self): self.attempt = 0

def next_delay(self): self.attempt += 1 base = min(2 ** self.attempt, 60) # Add 0-50% jitter return base * (0.5 + random.random() * 0.5) ```

1.**Limit connection pool size per host":
2.```python
3.from cassandra.pool import HostDistance

cluster = Cluster( contact_points=['node1', 'node2'], core_connections_per_host = { HostDistance.LOCAL: 2, HostDistance.REMOTE: 1 }, max_connections_per_host = { HostDistance.LOCAL: 8, HostDistance.REMOTE: 2 } ) ```

Prevention - Configure exponential backoff with jitter in all client drivers - Stagger application restarts after cluster recovery - Monitor connection counts per client IP with alerting - Use connection pooling with appropriate core/max settings - Implement circuit breakers at the application level - Test cluster recovery scenarios with full client load - Document the reconnection procedure and expected recovery timeline

Cassandra Client Driver Reconnection Storm After Cluster Recovery

Step-by-Step Fix 1. **Check current connection statistics": ```bash nodetool netstats # Shows: Mode, Not used in 2.1+, Pool Name, Active, Pending, Completed

Share this guide

More Cassandra Troubleshooting Guides

Cassandra Bloom Filter False Positive Rate High Causing Unnecessary Disk Reads

Cassandra Schema Disagreement Between Nodes After Rolling Upgrade

Cassandra Consistency Level QUORUM Not Achievable During Node Outage

Cassandra SSTable Corrupted on Disk After Unexpected Node Restart

Cassandra Repair Session Timeout During Incremental Repair

Cassandra Hinted Handoff Queue Full Causing Write Failures