Introduction Cassandra repair synchronizes data across replicas using Merkle trees. When repair sessions time out—due to large data ranges, network issues, or resource contention—the anti-entropy process fails, leaving replicas inconsistent. Repeated repair failures can lead to permanent data divergence across the cluster.
Symptoms - `nodetool repair` fails with `Repair session timed out` - `AntiEntropyService` logs show session timeout errors - Repair completes partially, some ranges remain unrepaired - `nodetool netstats` shows repair streaming stuck - Data inconsistencies detected by application-level validation
Common Causes - Repair range too large for the `rpc_timeout` window - Network latency between data centers exceeding timeout - Insufficient compaction throughput during repair (compaction competes with repair) - `repair_session_max_tree_depth` too high for the data size - Running repair during peak load when resources are constrained
Step-by-Step Fix 1. **Run repair on a smaller token range": ```bash # Get the token range for the node nodetool info | grep "Token"
# Repair a specific token range nodetool repair -st <start_token> -et <end_token> mykeyspace mytable
# Or repair one keyspace at a time nodetool repair mykeyspace -pr # Primary range only ```
- 1.**Increase repair timeout settings":
- 2.```yaml
- 3.# /etc/cassandra/cassandra.yaml
- 4.rpc_timeout_in_ms: 10000 # Increase from 5000
- 5.read_request_timeout_in_ms: 10000 # Increase for repair reads
- 6.range_request_timeout_in_ms: 30000 # Increase for range queries
# Repair-specific settings repair_session_max_tree_depth: 18 # Reduce from default 20 ```
- 1.**Use subrange repair for large tables":
- 2.```bash
- 3.# Split repair into multiple sub-ranges
- 4.# First, get the token range
- 5.nodetool describecluster | grep "Token ranges"
# Repair in chunks for start in $(seq 0 100 900); do end=$((start + 100)) nodetool repair mykeyspace mytable -st $start -et $end sleep 60 # Wait between repairs done ```
- 1.**Reduce repair resource contention":
- 2.```bash
- 3.# Reduce compaction throughput during repair
- 4.nodetool setcompactionthroughput 8
# Pause compaction during repair (if I/O is the bottleneck) nodetool disableautocompaction mykeyspace mytable nodetool repair mykeyspace mytable -pr nodetool enableautocompaction mykeyspace mytable ```
- 1.**Use Reaper for automated repair management":
- 2.```bash
- 3.# Install Reaper (https://cassandra-reaper.io/)
- 4.# It manages repair sessions, handles retries, and provides a web UI
# Configure incremental repair # Reaper handles subrange repairs and automatic scheduling ```