Introduction Cassandra uses the gossip protocol to exchange node state information across the cluster. When the phi convict failure detector incorrectly marks a healthy node as down—due to GC pauses, network jitter, or resource contention—the node is removed from the ring, triggering unnecessary repair and data redistribution operations.
Symptoms - `nodetool status` shows a node as `DN` (Down) when it is actually running - Cassandra logs show `isMarkedDead` or `philoss` threshold exceeded - Node flaps between UP and DOWN states repeatedly - Unnecessary streaming/repair operations triggered - Application sees increased latency as requests are routed away from the flapping node
Common Causes - Long GC pauses exceeding the gossip phi threshold - Network latency spikes between nodes (cross-AZ, cross-region) - CPU saturation preventing timely gossip message processing - `phi_convict_threshold` set too low for the network environment - Clock drift between cluster nodes
Step-by-Step Fix 1. **Check gossip state for the affected node": ```bash # On any healthy node nodetool describecluster nodetool status
# Check gossip info nodetool gossipinfo # Look for the affected node's heartbeat generation and status ```
- 1.**Check for GC pauses on the affected node":
- 2.```bash
- 3.# Check GC logs
- 4.grep "GC pause" /var/log/cassandra/gc.log | tail -20
# Check system.log for gossip-related messages grep -i "gossip|mark.*dead|phi" /var/log/cassandra/system.log | tail -20 ```
- 1.**Adjust the phi convict threshold":
- 2.```yaml
- 3.# /etc/cassandra/cassandra.yaml
- 4.# Default is 8.0. Increase for unreliable networks.
- 5.phi_convict_threshold: 12
# Also adjust gossip intervals if needed internode_timeout: 2000 # Default: 2000ms cross_node_timeout: true ```
- 1.**Restart gossip on the affected node":
- 2.```bash
- 3.# On the affected node, restart the gossip service
- 4.nodetool restartgossip
# Verify the node rejoins nodetool status ```
- 1.**If the node is completely isolated, restart Cassandra":
- 2.```bash
- 3.sudo systemctl restart cassandra
# After restart, run repair to ensure data consistency nodetool repair -pr # Primary range repair ```