Introduction Memcached clusters use client-side consistent hashing to distribute keys across nodes. When a node fails, all keys that were mapped to that node become cache misses simultaneously. This "cache miss storm" can overwhelm the database as every application request falls through to the backend until the cache is repopulated.
Symptoms - Sudden spike in database query rate after a Memcached node goes down - Application latency increasing as cache miss rate jumps from 5% to 50%+ - Monitoring shows one Memcached node unreachable - `STAT get_misses` spiking on remaining nodes as keys are remapped - Database CPU and I/O usage spiking correlating with cache node failure
Common Causes - Single Memcached node failure in a cluster without redundancy - Consistent hashing remapping all keys from the failed node - No cache warming procedure after node replacement - Application not implementing cache miss protection - No health check to detect node failure and stop sending requests
Step-by-Step Fix 1. **Detect and remove the failed node from the client configuration": ```python from pymemcache.client.hash import HashClient
# Remove the failed node client = HashClient( ['memcached1:11211', 'memcached3:11211'], # Removed memcached2 use_pooling=True, timeout=0.5, connect_timeout=0.5, dead_timeout=60 # Don't retry failed node for 60 seconds ) ```
- 1.**Implement cache miss throttling to protect the database":
- 2.```python
- 3.import threading
- 4.import time
class ThrottledCache: def __init__(self, client): self.client = client self.inflight = {} self.lock = threading.Lock()
def get(self, key, db_fetch_fn, ttl=300): value = self.client.get(key) if value is not None: return value
with self.lock: if key in self.inflight: # Another thread is fetching, wait briefly time.sleep(0.1) return self.client.get(key) self.inflight[key] = True
try: value = db_fetch_fn() self.client.set(key, value, expire=ttl) return value finally: with self.lock: self.inflight.pop(key, None) ```
- 1.**Warm the cache on the replacement node":
- 2.```python
- 3.# Pre-populate the most frequently accessed keys
- 4.hot_keys = ['config:main', 'user:popular', 'session:active', ...]
- 5.for key in hot_keys:
- 6.value = fetch_from_database(key)
- 7.client.set(key, value, expire=3600)
- 8.
` - 9.**Replace the failed node and rebalance":
- 10.```bash
- 11.# Start a new Memcached node
- 12.memcached -m 4096 -c 10000 -p 11211 -d
# Add it back to the client configuration # Note: Consistent hashing means most keys will still map correctly # Only keys that were on the failed node will be redistributed ```