Introduction Cassandra repair synchronizes data across replicas using Merkle trees. When repair sessions time out—due to large data ranges, network issues, or resource contention—the anti-entropy process fails, leaving replicas inconsistent. Repeated repair failures can lead to permanent data divergence across the cluster.

Symptoms - `nodetool repair` fails with `Repair session timed out` - `AntiEntropyService` logs show session timeout errors - Repair completes partially, some ranges remain unrepaired - `nodetool netstats` shows repair streaming stuck - Data inconsistencies detected by application-level validation

Common Causes - Repair range too large for the `rpc_timeout` window - Network latency between data centers exceeding timeout - Insufficient compaction throughput during repair (compaction competes with repair) - `repair_session_max_tree_depth` too high for the data size - Running repair during peak load when resources are constrained

Step-by-Step Fix 1. **Run repair on a smaller token range": ```bash # Get the token range for the node nodetool info | grep "Token"

# Repair a specific token range nodetool repair -st <start_token> -et <end_token> mykeyspace mytable

# Or repair one keyspace at a time nodetool repair mykeyspace -pr # Primary range only ```

  1. 1.**Increase repair timeout settings":
  2. 2.```yaml
  3. 3.# /etc/cassandra/cassandra.yaml
  4. 4.rpc_timeout_in_ms: 10000 # Increase from 5000
  5. 5.read_request_timeout_in_ms: 10000 # Increase for repair reads
  6. 6.range_request_timeout_in_ms: 30000 # Increase for range queries

# Repair-specific settings repair_session_max_tree_depth: 18 # Reduce from default 20 ```

  1. 1.**Use subrange repair for large tables":
  2. 2.```bash
  3. 3.# Split repair into multiple sub-ranges
  4. 4.# First, get the token range
  5. 5.nodetool describecluster | grep "Token ranges"

# Repair in chunks for start in $(seq 0 100 900); do end=$((start + 100)) nodetool repair mykeyspace mytable -st $start -et $end sleep 60 # Wait between repairs done ```

  1. 1.**Reduce repair resource contention":
  2. 2.```bash
  3. 3.# Reduce compaction throughput during repair
  4. 4.nodetool setcompactionthroughput 8

# Pause compaction during repair (if I/O is the bottleneck) nodetool disableautocompaction mykeyspace mytable nodetool repair mykeyspace mytable -pr nodetool enableautocompaction mykeyspace mytable ```

  1. 1.**Use Reaper for automated repair management":
  2. 2.```bash
  3. 3.# Install Reaper (https://cassandra-reaper.io/)
  4. 4.# It manages repair sessions, handles retries, and provides a web UI

# Configure incremental repair # Reaper handles subrange repairs and automatic scheduling ```

Prevention - Schedule repairs during low-traffic periods - Use incremental repair (`nodetool repair -pr`) instead of full repair - Deploy Reaper for automated, monitored repair management - Monitor repair duration and adjust timeout settings accordingly - Run repairs more frequently to reduce the amount of data per session - Ensure all nodes have consistent schema before running repair - Test repair procedures in staging with production-sized data