Introduction Cassandra's hinted handoff mechanism stores write operations on a coordinator node when the target replica is temporarily down. When the hinted handoff queue fills up—due to extended node downtime or high write rates—the coordinator can no longer buffer writes and returns errors to the client.

Symptoms - `HintedHandoffManager` logs show `queue size exceeded max hints window` - Write operations return `UnavailableException` even though some replicas are up - `nodetool status` shows hints waiting for delivery - Coordinator nodes experiencing disk space pressure from hint files - Write latency increasing as hint storage I/O competes with regular writes

Common Causes - Replica node down longer than `max_hint_window_in_ms` (default 3 hours) - `max_hints_delivery_threads` too low, unable to drain hints fast enough - Hinted handoff directory on a small filesystem - Multiple nodes down simultaneously, creating hints on all coordinators - High write rate generating hints faster than they can be delivered

Step-by-Step Fix 1. **Check hinted handoff status": ```bash # Check hint delivery status nodetool getendpoints mykeyspace mytable <partition_key>

# Check for pending hints ls -la /var/lib/cassandra/hints/ du -sh /var/lib/cassandra/hints/

# Check system log for handoff messages grep -i "hint" /var/log/cassandra/system.log | tail -20 ```

  1. 1.**Increase hinted handoff delivery capacity":
  2. 2.```yaml
  3. 3.# /etc/cassandra/cassandra.yaml
  4. 4.max_hint_window_in_ms: 10800000 # 3 hours (increase to 6 hours)
  5. 5.max_hints_delivery_threads: 4 # Increase from default 2
  6. 6.hints_flush_period_in_ms: 10000 # Flush hints every 10s
  7. 7.max_hints_file_size_in_mb: 128 # Limit individual hint file size
  8. 8.`
  9. 9.**Restart hint delivery for a specific node":
  10. 10.```bash
  11. 11.# Pause and resume hint delivery
  12. 12.nodetool pausehint <node_id>
  13. 13.nodetool resumehint <node_id>

# Or pause/resume all nodetool disablehintsfordc <datacenter> nodetool enablehintsfordc <datacenter> ```

  1. 1.**Clean up stale hints":
  2. 2.```bash
  3. 3.# Stop Cassandra on the coordinator
  4. 4.sudo systemctl stop cassandra

# Remove stale hint files (older than max_hint_window) find /var/lib/cassandra/hints/ -name "*.db" -mtime +1 -delete

# Restart Cassandra sudo systemctl start cassandra ```

  1. 1.**Adjust write consistency level for availability":
  2. 2.```python
  3. 3.# In the application, use LOCAL_QUORUM instead of ALL or EACH_QUORUM
  4. 4.from cassandra.query import ConsistencyLevel

session.default_consistency_level = ConsistencyLevel.LOCAL_QUORUM

# For time-series data where some loss is acceptable session.default_consistency_level = ConsistencyLevel.ONE ```

Prevention - Monitor hint directory size with alerting - Set `max_hint_window_in_ms` based on maximum expected node downtime - Keep hinted handoff directory on a separate disk from data files - Ensure `max_hints_delivery_threads` matches the I/O capacity - Monitor node health to detect and recover down nodes quickly - Use appropriate consistency levels that match availability requirements - Implement application-level retry logic for write failures