Cassandra Repair Session Timeout - Fix Incremental Repair

Introduction Cassandra repair synchronizes data across replicas using Merkle trees. When repair sessions time out—due to large data ranges, network issues, or resource contention—the anti-entropy process fails, leaving replicas inconsistent. Repeated repair failures can lead to permanent data divergence across the cluster.

Symptoms - `nodetool repair` fails with `Repair session timed out` - `AntiEntropyService` logs show session timeout errors - Repair completes partially, some ranges remain unrepaired - `nodetool netstats` shows repair streaming stuck - Data inconsistencies detected by application-level validation

Common Causes - Repair range too large for the `rpc_timeout` window - Network latency between data centers exceeding timeout - Insufficient compaction throughput during repair (compaction competes with repair) - `repair_session_max_tree_depth` too high for the data size - Running repair during peak load when resources are constrained

Step-by-Step Fix 1. **Run repair on a smaller token range": ```bash # Get the token range for the node nodetool info | grep "Token"

# Repair a specific token range nodetool repair -st <start_token> -et <end_token> mykeyspace mytable

# Or repair one keyspace at a time nodetool repair mykeyspace -pr # Primary range only ```

1.**Increase repair timeout settings":
2.```yaml
3.# /etc/cassandra/cassandra.yaml
4.rpc_timeout_in_ms: 10000 # Increase from 5000
5.read_request_timeout_in_ms: 10000 # Increase for repair reads
6.range_request_timeout_in_ms: 30000 # Increase for range queries

# Repair-specific settings repair_session_max_tree_depth: 18 # Reduce from default 20 ```

1.**Use subrange repair for large tables":
2.```bash
3.# Split repair into multiple sub-ranges
4.# First, get the token range
5.nodetool describecluster | grep "Token ranges"

# Repair in chunks for start in $(seq 0 100 900); do end=$((start + 100)) nodetool repair mykeyspace mytable -st $start -et $end sleep 60 # Wait between repairs done ```

1.**Reduce repair resource contention":
2.```bash
3.# Reduce compaction throughput during repair
4.nodetool setcompactionthroughput 8

# Pause compaction during repair (if I/O is the bottleneck) nodetool disableautocompaction mykeyspace mytable nodetool repair mykeyspace mytable -pr nodetool enableautocompaction mykeyspace mytable ```

1.**Use Reaper for automated repair management":
2.```bash
3.# Install Reaper (https://cassandra-reaper.io/)
4.# It manages repair sessions, handles retries, and provides a web UI

# Configure incremental repair # Reaper handles subrange repairs and automatic scheduling ```

Prevention - Schedule repairs during low-traffic periods - Use incremental repair (`nodetool repair -pr`) instead of full repair - Deploy Reaper for automated, monitored repair management - Monitor repair duration and adjust timeout settings accordingly - Run repairs more frequently to reduce the amount of data per session - Ensure all nodes have consistent schema before running repair - Test repair procedures in staging with production-sized data

Cassandra Repair Session Timeout During Incremental Repair

Symptoms - `nodetool repair` fails with `Repair session timed out` - `AntiEntropyService` logs show session timeout errors - Repair completes partially, some ranges remain unrepaired - `nodetool netstats` shows repair streaming stuck - Data inconsistencies detected by application-level validation

Step-by-Step Fix 1. **Run repair on a smaller token range": ```bash # Get the token range for the node nodetool info | grep "Token"

Share this guide

More Cassandra Troubleshooting Guides

Cassandra Client Driver Reconnection Storm After Cluster Recovery

Cassandra Bloom Filter False Positive Rate High Causing Unnecessary Disk Reads

Cassandra Schema Disagreement Between Nodes After Rolling Upgrade

Cassandra Consistency Level QUORUM Not Achievable During Node Outage

Cassandra SSTable Corrupted on Disk After Unexpected Node Restart

Cassandra Hinted Handoff Queue Full Causing Write Failures