Introduction When a Cassandra cluster recovers from an outage, hundreds or thousands of client connections attempt to reconnect simultaneously. This reconnection storm can overwhelm the cluster's connection handling capacity, causing authentication timeouts, request queuing, and potentially triggering a second outage.
Symptoms - Cassandra logs show connection storms with `Too many connections from /x.x.x.x` - `nodetool tpstats` shows `MutationStage` or `ReadStage` pools with high pending tasks - Authentication timeouts as the cluster processes thousands of simultaneous connections - Application clients report `NoHostAvailableException` despite cluster being up - CPU spikes on Cassandra nodes from connection processing
Common Causes - All application instances restarting simultaneously after cluster outage - Client driver reconnection policy with no backoff or jitter - Connection pool configured to create maximum connections immediately - No rate limiting on new connections at the Cassandra level - Load balancer forwarding all reconnections to a single node
Step-by-Step Fix 1. **Check current connection statistics": ```bash nodetool netstats # Shows: Mode, Not used in 2.1+, Pool Name, Active, Pending, Completed
nodetool tpstats # Shows thread pool status including connection handling ```
- 1.**Configure the client driver with exponential backoff":
- 2.```python
- 3.from cassandra.cluster import Cluster
- 4.from cassandra.policies import DCAwareRoundRobinPolicy, ExponentialReconnectionPolicy
cluster = Cluster( contact_points=['node1', 'node2', 'node3'], load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='dc1'), reconnection_policy=ExponentialReconnectionPolicy( base_delay=1.0, # Start with 1 second max_delay=60.0 # Max 60 seconds between retries ), protocol_version=4 ) session = cluster.connect('mykeyspace') ```
- 1.**Add jitter to prevent thundering herd":
- 2.```python
- 3.import random
- 4.import time
- 5.from cassandra.policies import ReconnectionPolicy, ReconnectionSchedule
class JitteredExponentialReconnectionPolicy(ReconnectionPolicy): def new_schedule(self, *args, **kwargs): return JitteredReconnectionSchedule()
class JitteredReconnectionSchedule(ReconnectionSchedule): def __init__(self): self.attempt = 0
def reset(self): self.attempt = 0
def next_delay(self): self.attempt += 1 base = min(2 ** self.attempt, 60) # Add 0-50% jitter return base * (0.5 + random.random() * 0.5) ```
- 1.**Limit connection pool size per host":
- 2.```python
- 3.from cassandra.pool import HostDistance
cluster = Cluster( contact_points=['node1', 'node2'], core_connections_per_host = { HostDistance.LOCAL: 2, HostDistance.REMOTE: 1 }, max_connections_per_host = { HostDistance.LOCAL: 8, HostDistance.REMOTE: 2 } ) ```