What's Actually Happening
Consul KV store operations fail or timeout when the Raft consensus cannot reach a leader or when nodes cannot communicate properly.
The Error You'll See
KV read failure:
```bash $ consul kv get config/app
Error: failed to get key: Unexpected response code: 500 (rpc error making call: The cluster is still initializing. The Consul servers are still electing a leader) ```
KV write timeout:
```bash $ consul kv put config/app value
Error: timed out waiting for write to be replicated ```
No leader:
```bash $ consul operator raft list-peers
Error: raft: no leader elected ```
Why This Happens
- 1.No Raft leader - Leader election failed or ongoing
- 2.Quorum loss - Majority of servers unavailable
- 3.Network partition - Servers cannot communicate
- 4.Disk I/O issues - Raft logs slow to write
- 5.Server overload - High CPU/memory pressure
- 6.ACL restrictions - Insufficient permissions for KV access
Step 1: Check Consul Status
```bash # Check Consul members consul members
# Output: # Node Address Status Type Build Protocol DC Segment # consul-1 10.0.0.1:8301 alive server 1.15.0 2 dc1 <all> # consul-2 10.0.0.2:8301 failed server 1.15.0 2 dc1 <all> # consul-3 10.0.0.3:8301 alive server 1.15.0 2 dc1 <all>
# Check leader consul operator raft list-peers
# Output shows leader ID: # Node ID Address State Voter RaftProtocol # consul-1 xxx 10.0.0.1:8300 leader true 3 # consul-2 yyy 10.0.0.2:8300 follower true 3 # consul-3 zzz 10.0.0.3:8300 follower true 3
# If no leader: consul operator raft remove-peer -peer-id=<id> # Then restart node to rejoin
# Check catalog status consul catalog nodes ```
Step 2: Check Server Connectivity
```bash # Check network connectivity between servers ping 10.0.0.2 ping 10.0.0.3
# Check Raft ports (default 8300) nc -zv 10.0.0.2 8300 nc -zv 10.0.0.3 8300
# Check LAN gossip port (8301) nc -zv 10.0.0.2 8301
# Check WAN gossip port (8302) for multi-DC nc -zv 10.0.0.2 8302
# Check firewall rules iptables -L -n | grep 8300 iptables -L -n | grep 8301
# Allow Consul ports iptables -I INPUT -p tcp --dport 8300 -j ACCEPT iptables -I INPUT -p tcp --dport 8301 -j ACCEPT iptables -I INPUT -p udp --dport 8301 -j ACCEPT
# Check for network partitions consul operator area list ```
Step 3: Check Raft Health
```bash # Check Raft configuration consul operator raft list-peers -detailed
# Output includes: # LastContact: time since last contact # LastTerm: current Raft term # LastIndex: log position
# Check Raft logs ls -la /opt/consul/data/raft/
# Check raft.db sqlite3 /opt/consul/data/raft/raft.db ".tables"
# Check snapshots ls -la /opt/consul/data/raft/snapshots/
# If snapshots corrupted, may need to restore
# Check Raft leadership stability consul monitor -log-level=debug | grep raft
# Look for: # - raft: elected leader # - raft: heartbeat timeout # - raft: failed to contact ```
Step 4: Restart Failed Servers
```bash # Check which servers are down consul members | grep failed
# Restart failed server ssh consul-2 "systemctl restart consul"
# Wait for rejoin consul members
# If server won't rejoin, check logs: journalctl -u consul -n 50
# Common issues: # - Wrong bootstrap configuration # - Data corruption # - Network blocked
# Force rejoin if auto-join failed: consul join 10.0.0.1
# Check if quorum restored consul operator raft list-peers # Need majority of servers alive (2 of 3, 3 of 5) ```
Step 5: Recover Lost Quorum
```bash # If majority of servers lost, need manual recovery
# For 3-node cluster with only 1 alive: # Cannot achieve quorum automatically
# Option 1: Bootstrap new cluster (CAUTION: loses data) # Stop remaining server systemctl stop consul
# Remove old Raft data rm -rf /opt/consul/data/raft/*
# Bootstrap with single server consul agent -bootstrap-expect=1 -server -data-dir=/opt/consul/data
# Then add new servers: consul operator raft add-peer -address=10.0.0.2:8300
# Option 2: Restore from snapshot consul snapshot restore backup.snap
# Option 3: Recreate missing peers # On surviving server: consul operator raft remove-peer -peer-id=<failed-peer-id> consul operator raft add-peer -address=<new-server>:8300
# Start new server with existing cluster config: consul agent -server -join=10.0.0.1 -data-dir=/opt/consul/data ```
Step 6: Check Disk Performance
```bash # Check disk where Consul data stored df -h /opt/consul
# Check disk I/O iostat -x 1 10
# Check for slow writes affecting Raft journalctl -u consul | grep -i "slow|latency"
# Check Consul data directory size du -sh /opt/consul/data
# If disk full: df -h /opt/consul # Free up space or expand disk
# Ensure SSD for production: # HDD causes Raft heartbeat timeouts
# Check Consul performance: consul debug -duration=30s -output=/tmp/consul-debug # Analyze debug bundle for issues ```
Step 7: Check ACL Configuration
```bash # Check if ACLs enabled consul acl policy list
# If ACLs enabled, check token permissions consul acl token read -id=<token-id>
# Check policy has KV permissions: consul acl policy read -name=<policy-name>
# Required rules for KV: key_prefix "" { policy = "write" } # Or specific keys: key_prefix "config/" { policy = "read" }
# Create token with KV permissions: consul acl token create -policy-name=<kv-policy>
# Use token in requests: consul kv get config/app -token=<token>
# Set default token in Consul config: # In consul.hcl: acl { tokens { default = "<management-token>" } } ```
Step 8: Check Consul Logs
```bash # Check Consul logs journalctl -u consul -n 100 --no-pager
# Look for specific errors: journalctl -u consul | grep -i "raft|leader|timeout|error"
# Common errors: # - raft: failed to contact peer # - raft: leadership lost # - rpc: call failed # - raft: no leader
# Enable debug logging: # In consul.hcl: log_level = "DEBUG"
# Restart Consul systemctl restart consul
# Watch logs in real-time: consul monitor -log-level=debug ```
Step 9: Verify KV Operations
```bash # Test KV read after recovery consul kv get config/app
# Test KV write consul kv put config/test value
# Output: Success! Created entry at: config/test
# Test KV delete consul kv delete config/test
# Test recursive operations consul kv get -recurse config/
# Check KV tree consul kv get -keys config/
# Use HTTP API: curl http://localhost:8500/v1/kv/config/app
# With proper headers: curl -X PUT http://localhost:8500/v1/kv/config/app -d "value" ```
Step 10: Monitor Consul Health
```bash # Create monitoring script cat << 'EOF' > /usr/local/bin/monitor_consul.sh #!/bin/bash
echo "=== Consul Members ===" consul members
echo "" echo "=== Raft Peers ===" consul operator raft list-peers
echo "" echo "=== Leader Status ===" consul operator raft list-peers | grep leader || echo "No leader!"
echo "" echo "=== KV Test ===" consul kv put test/monitor check && consul kv get test/monitor && consul kv delete test/monitor
echo "" echo "=== Node Health Checks ===" consul catalog nodes -detailed
echo "" echo "=== Service Health ===" consul catalog services EOF
chmod +x /usr/local/bin/monitor_consul.sh
# Prometheus metrics: # Consul exposes metrics at /v1/agent/metrics curl http://localhost:8500/v1/agent/metrics | jq
# Key metrics: # consul_raft_leader - is this node leader # consul_raft_peers - number of peers # consul_raft_apply - transactions applied # consul_rpc_request_error - RPC errors
# Alert rule: - alert: ConsulNoLeader expr: consul_raft_leader == 0 and consul_server == 1 for: 1m labels: severity: critical annotations: summary: "Consul cluster has no leader" ```
Consul KV Store Checklist
| Check | Command | Expected |
|---|---|---|
| Leader elected | raft list-peers | Has leader |
| Quorum members | consul members | >= majority |
| Network ports | nc -zv 8300 | Connected |
| Disk space | df -h | > 20% free |
| ACL permissions | acl token read | KV access |
| KV operations | kv put/get | Success |
Verify the Fix
```bash # After resolving KV store issue
# 1. Check leader consul operator raft list-peers // Shows leader node
# 2. Test KV write consul kv put config/app value // Success! Created entry
# 3. Test KV read consul kv get config/app // Returns value
# 4. Check all servers consul members | grep server // All servers alive
# 5. Test multiple operations consul kv get -recurse config/ // Lists all keys
# 6. Monitor Raft stability consul monitor | grep raft // No leadership changes ```
Related Issues
- [Fix Consul Agent Not Starting](/articles/fix-consul-agent-not-starting)
- [Fix Consul Service Not Registering](/articles/fix-consul-service-not-registering)
- [Fix Consul DNS Resolution Wrong](/articles/fix-consul-dns-resolution-wrong)