What's Actually Happening

Consul KV store operations fail or timeout when the Raft consensus cannot reach a leader or when nodes cannot communicate properly.

The Error You'll See

KV read failure:

```bash $ consul kv get config/app

Error: failed to get key: Unexpected response code: 500 (rpc error making call: The cluster is still initializing. The Consul servers are still electing a leader) ```

KV write timeout:

```bash $ consul kv put config/app value

Error: timed out waiting for write to be replicated ```

No leader:

```bash $ consul operator raft list-peers

Error: raft: no leader elected ```

Why This Happens

  1. 1.No Raft leader - Leader election failed or ongoing
  2. 2.Quorum loss - Majority of servers unavailable
  3. 3.Network partition - Servers cannot communicate
  4. 4.Disk I/O issues - Raft logs slow to write
  5. 5.Server overload - High CPU/memory pressure
  6. 6.ACL restrictions - Insufficient permissions for KV access

Step 1: Check Consul Status

```bash # Check Consul members consul members

# Output: # Node Address Status Type Build Protocol DC Segment # consul-1 10.0.0.1:8301 alive server 1.15.0 2 dc1 <all> # consul-2 10.0.0.2:8301 failed server 1.15.0 2 dc1 <all> # consul-3 10.0.0.3:8301 alive server 1.15.0 2 dc1 <all>

# Check leader consul operator raft list-peers

# Output shows leader ID: # Node ID Address State Voter RaftProtocol # consul-1 xxx 10.0.0.1:8300 leader true 3 # consul-2 yyy 10.0.0.2:8300 follower true 3 # consul-3 zzz 10.0.0.3:8300 follower true 3

# If no leader: consul operator raft remove-peer -peer-id=<id> # Then restart node to rejoin

# Check catalog status consul catalog nodes ```

Step 2: Check Server Connectivity

```bash # Check network connectivity between servers ping 10.0.0.2 ping 10.0.0.3

# Check Raft ports (default 8300) nc -zv 10.0.0.2 8300 nc -zv 10.0.0.3 8300

# Check LAN gossip port (8301) nc -zv 10.0.0.2 8301

# Check WAN gossip port (8302) for multi-DC nc -zv 10.0.0.2 8302

# Check firewall rules iptables -L -n | grep 8300 iptables -L -n | grep 8301

# Allow Consul ports iptables -I INPUT -p tcp --dport 8300 -j ACCEPT iptables -I INPUT -p tcp --dport 8301 -j ACCEPT iptables -I INPUT -p udp --dport 8301 -j ACCEPT

# Check for network partitions consul operator area list ```

Step 3: Check Raft Health

```bash # Check Raft configuration consul operator raft list-peers -detailed

# Output includes: # LastContact: time since last contact # LastTerm: current Raft term # LastIndex: log position

# Check Raft logs ls -la /opt/consul/data/raft/

# Check raft.db sqlite3 /opt/consul/data/raft/raft.db ".tables"

# Check snapshots ls -la /opt/consul/data/raft/snapshots/

# If snapshots corrupted, may need to restore

# Check Raft leadership stability consul monitor -log-level=debug | grep raft

# Look for: # - raft: elected leader # - raft: heartbeat timeout # - raft: failed to contact ```

Step 4: Restart Failed Servers

```bash # Check which servers are down consul members | grep failed

# Restart failed server ssh consul-2 "systemctl restart consul"

# Wait for rejoin consul members

# If server won't rejoin, check logs: journalctl -u consul -n 50

# Common issues: # - Wrong bootstrap configuration # - Data corruption # - Network blocked

# Force rejoin if auto-join failed: consul join 10.0.0.1

# Check if quorum restored consul operator raft list-peers # Need majority of servers alive (2 of 3, 3 of 5) ```

Step 5: Recover Lost Quorum

```bash # If majority of servers lost, need manual recovery

# For 3-node cluster with only 1 alive: # Cannot achieve quorum automatically

# Option 1: Bootstrap new cluster (CAUTION: loses data) # Stop remaining server systemctl stop consul

# Remove old Raft data rm -rf /opt/consul/data/raft/*

# Bootstrap with single server consul agent -bootstrap-expect=1 -server -data-dir=/opt/consul/data

# Then add new servers: consul operator raft add-peer -address=10.0.0.2:8300

# Option 2: Restore from snapshot consul snapshot restore backup.snap

# Option 3: Recreate missing peers # On surviving server: consul operator raft remove-peer -peer-id=<failed-peer-id> consul operator raft add-peer -address=<new-server>:8300

# Start new server with existing cluster config: consul agent -server -join=10.0.0.1 -data-dir=/opt/consul/data ```

Step 6: Check Disk Performance

```bash # Check disk where Consul data stored df -h /opt/consul

# Check disk I/O iostat -x 1 10

# Check for slow writes affecting Raft journalctl -u consul | grep -i "slow|latency"

# Check Consul data directory size du -sh /opt/consul/data

# If disk full: df -h /opt/consul # Free up space or expand disk

# Ensure SSD for production: # HDD causes Raft heartbeat timeouts

# Check Consul performance: consul debug -duration=30s -output=/tmp/consul-debug # Analyze debug bundle for issues ```

Step 7: Check ACL Configuration

```bash # Check if ACLs enabled consul acl policy list

# If ACLs enabled, check token permissions consul acl token read -id=<token-id>

# Check policy has KV permissions: consul acl policy read -name=<policy-name>

# Required rules for KV: key_prefix "" { policy = "write" } # Or specific keys: key_prefix "config/" { policy = "read" }

# Create token with KV permissions: consul acl token create -policy-name=<kv-policy>

# Use token in requests: consul kv get config/app -token=<token>

# Set default token in Consul config: # In consul.hcl: acl { tokens { default = "<management-token>" } } ```

Step 8: Check Consul Logs

```bash # Check Consul logs journalctl -u consul -n 100 --no-pager

# Look for specific errors: journalctl -u consul | grep -i "raft|leader|timeout|error"

# Common errors: # - raft: failed to contact peer # - raft: leadership lost # - rpc: call failed # - raft: no leader

# Enable debug logging: # In consul.hcl: log_level = "DEBUG"

# Restart Consul systemctl restart consul

# Watch logs in real-time: consul monitor -log-level=debug ```

Step 9: Verify KV Operations

```bash # Test KV read after recovery consul kv get config/app

# Test KV write consul kv put config/test value

# Output: Success! Created entry at: config/test

# Test KV delete consul kv delete config/test

# Test recursive operations consul kv get -recurse config/

# Check KV tree consul kv get -keys config/

# Use HTTP API: curl http://localhost:8500/v1/kv/config/app

# With proper headers: curl -X PUT http://localhost:8500/v1/kv/config/app -d "value" ```

Step 10: Monitor Consul Health

```bash # Create monitoring script cat << 'EOF' > /usr/local/bin/monitor_consul.sh #!/bin/bash

echo "=== Consul Members ===" consul members

echo "" echo "=== Raft Peers ===" consul operator raft list-peers

echo "" echo "=== Leader Status ===" consul operator raft list-peers | grep leader || echo "No leader!"

echo "" echo "=== KV Test ===" consul kv put test/monitor check && consul kv get test/monitor && consul kv delete test/monitor

echo "" echo "=== Node Health Checks ===" consul catalog nodes -detailed

echo "" echo "=== Service Health ===" consul catalog services EOF

chmod +x /usr/local/bin/monitor_consul.sh

# Prometheus metrics: # Consul exposes metrics at /v1/agent/metrics curl http://localhost:8500/v1/agent/metrics | jq

# Key metrics: # consul_raft_leader - is this node leader # consul_raft_peers - number of peers # consul_raft_apply - transactions applied # consul_rpc_request_error - RPC errors

# Alert rule: - alert: ConsulNoLeader expr: consul_raft_leader == 0 and consul_server == 1 for: 1m labels: severity: critical annotations: summary: "Consul cluster has no leader" ```

Consul KV Store Checklist

CheckCommandExpected
Leader electedraft list-peersHas leader
Quorum membersconsul members>= majority
Network portsnc -zv 8300Connected
Disk spacedf -h> 20% free
ACL permissionsacl token readKV access
KV operationskv put/getSuccess

Verify the Fix

```bash # After resolving KV store issue

# 1. Check leader consul operator raft list-peers // Shows leader node

# 2. Test KV write consul kv put config/app value // Success! Created entry

# 3. Test KV read consul kv get config/app // Returns value

# 4. Check all servers consul members | grep server // All servers alive

# 5. Test multiple operations consul kv get -recurse config/ // Lists all keys

# 6. Monitor Raft stability consul monitor | grep raft // No leadership changes ```

  • [Fix Consul Agent Not Starting](/articles/fix-consul-agent-not-starting)
  • [Fix Consul Service Not Registering](/articles/fix-consul-service-not-registering)
  • [Fix Consul DNS Resolution Wrong](/articles/fix-consul-dns-resolution-wrong)