Introduction Cassandra uses the gossip protocol to exchange node state information across the cluster. When the phi convict failure detector incorrectly marks a healthy node as down—due to GC pauses, network jitter, or resource contention—the node is removed from the ring, triggering unnecessary repair and data redistribution operations.

Symptoms - `nodetool status` shows a node as `DN` (Down) when it is actually running - Cassandra logs show `isMarkedDead` or `philoss` threshold exceeded - Node flaps between UP and DOWN states repeatedly - Unnecessary streaming/repair operations triggered - Application sees increased latency as requests are routed away from the flapping node

Common Causes - Long GC pauses exceeding the gossip phi threshold - Network latency spikes between nodes (cross-AZ, cross-region) - CPU saturation preventing timely gossip message processing - `phi_convict_threshold` set too low for the network environment - Clock drift between cluster nodes

Step-by-Step Fix 1. **Check gossip state for the affected node": ```bash # On any healthy node nodetool describecluster nodetool status

# Check gossip info nodetool gossipinfo # Look for the affected node's heartbeat generation and status ```

  1. 1.**Check for GC pauses on the affected node":
  2. 2.```bash
  3. 3.# Check GC logs
  4. 4.grep "GC pause" /var/log/cassandra/gc.log | tail -20

# Check system.log for gossip-related messages grep -i "gossip|mark.*dead|phi" /var/log/cassandra/system.log | tail -20 ```

  1. 1.**Adjust the phi convict threshold":
  2. 2.```yaml
  3. 3.# /etc/cassandra/cassandra.yaml
  4. 4.# Default is 8.0. Increase for unreliable networks.
  5. 5.phi_convict_threshold: 12

# Also adjust gossip intervals if needed internode_timeout: 2000 # Default: 2000ms cross_node_timeout: true ```

  1. 1.**Restart gossip on the affected node":
  2. 2.```bash
  3. 3.# On the affected node, restart the gossip service
  4. 4.nodetool restartgossip

# Verify the node rejoins nodetool status ```

  1. 1.**If the node is completely isolated, restart Cassandra":
  2. 2.```bash
  3. 3.sudo systemctl restart cassandra

# After restart, run repair to ensure data consistency nodetool repair -pr # Primary range repair ```

Prevention - Set `phi_convict_threshold` to 10-12 for cross-AZ deployments - Monitor GC pauses and keep them under 200ms - Use dedicated network interfaces for inter-node gossip traffic - Ensure NTP is running on all nodes with sub-second clock accuracy - Monitor gossip message round-trip times with nodetool gossipinfo - Deploy nodes in the same region/availability zone when possible - Set up alerting on node state changes in monitoring dashboards