Introduction Memcached clusters use client-side consistent hashing to distribute keys across nodes. When a node fails, all keys that were mapped to that node become cache misses simultaneously. This "cache miss storm" can overwhelm the database as every application request falls through to the backend until the cache is repopulated.

Symptoms - Sudden spike in database query rate after a Memcached node goes down - Application latency increasing as cache miss rate jumps from 5% to 50%+ - Monitoring shows one Memcached node unreachable - `STAT get_misses` spiking on remaining nodes as keys are remapped - Database CPU and I/O usage spiking correlating with cache node failure

Common Causes - Single Memcached node failure in a cluster without redundancy - Consistent hashing remapping all keys from the failed node - No cache warming procedure after node replacement - Application not implementing cache miss protection - No health check to detect node failure and stop sending requests

Step-by-Step Fix 1. **Detect and remove the failed node from the client configuration": ```python from pymemcache.client.hash import HashClient

# Remove the failed node client = HashClient( ['memcached1:11211', 'memcached3:11211'], # Removed memcached2 use_pooling=True, timeout=0.5, connect_timeout=0.5, dead_timeout=60 # Don't retry failed node for 60 seconds ) ```

  1. 1.**Implement cache miss throttling to protect the database":
  2. 2.```python
  3. 3.import threading
  4. 4.import time

class ThrottledCache: def __init__(self, client): self.client = client self.inflight = {} self.lock = threading.Lock()

def get(self, key, db_fetch_fn, ttl=300): value = self.client.get(key) if value is not None: return value

with self.lock: if key in self.inflight: # Another thread is fetching, wait briefly time.sleep(0.1) return self.client.get(key) self.inflight[key] = True

try: value = db_fetch_fn() self.client.set(key, value, expire=ttl) return value finally: with self.lock: self.inflight.pop(key, None) ```

  1. 1.**Warm the cache on the replacement node":
  2. 2.```python
  3. 3.# Pre-populate the most frequently accessed keys
  4. 4.hot_keys = ['config:main', 'user:popular', 'session:active', ...]
  5. 5.for key in hot_keys:
  6. 6.value = fetch_from_database(key)
  7. 7.client.set(key, value, expire=3600)
  8. 8.`
  9. 9.**Replace the failed node and rebalance":
  10. 10.```bash
  11. 11.# Start a new Memcached node
  12. 12.memcached -m 4096 -c 10000 -p 11211 -d

# Add it back to the client configuration # Note: Consistent hashing means most keys will still map correctly # Only keys that were on the failed node will be redistributed ```

Prevention - Use consistent hashing with virtual nodes (vnodes) to minimize key remapping - Implement cache miss throttling (request coalescing) in all applications - Monitor Memcached node health with automated failure detection - Pre-warm new nodes with hot keys before adding them to the client pool - Use multiple smaller nodes instead of fewer large nodes to limit blast radius - Implement circuit breakers to stop hitting the database when cache miss rate is high - Consider Memcached with auto-discovery or a service mesh for dynamic node management