Fix Consul Snapshot Backup Failed

What's Actually Happening

Consul snapshot backup fails due to Raft issues, insufficient permissions, or storage problems. Snapshots are critical for disaster recovery and cluster restoration.

The Error You'll See

Snapshot save failure:

```bash $ consul snapshot save backup.snap

Error: failed to save snapshot: raft: no leader ```

Permission denied:

```bash $ consul snapshot save backup.snap

Error: failed to save snapshot: Permission denied ```

Write error:

```bash $ consul snapshot save /backup/consul.snap

Error: failed to save snapshot: write error: no space left on device ```

Why This Happens

1.No Raft leader - Leader required for snapshot
2.ACL restrictions - Token lacks snapshot permission
3.Disk space - No room for snapshot file
4.Quorum issues - Insufficient servers for consensus
5.Large state - KV store too large for timeout
6.Network issues - Cannot reach leader

Step 1: Check Leader Status

```bash # Check for leader consul operator raft list-peers

# Output: # Node ID Address State Voter # consul-1 xxx 10.0.0.1:8300 leader true

# If no leader: consul members

# Check server count consul members | grep server | wc -l

# Need majority for leader election # 3 servers: need 2 alive # 5 servers: need 3 alive

# If servers down, restore quorum first: consul operator raft remove-peer -peer-id=<failed-id> consul operator raft add-peer -address=<new-server>:8300

# Retry snapshot after leader elected consul snapshot save backup.snap ```

Step 2: Check ACL Permissions

```bash # Check if ACLs enabled consul acl policy list

# Check token permissions consul acl token read -id=<your-token>

# Required policy for snapshot: # In policy.hcl: snapshot { policy = "write" }

# Create snapshot policy: consul acl policy create -name=snapshot-policy -rules='snapshot { policy = "write" }'

# Create token with policy: consul acl token create -policy-name=snapshot-policy -description="Backup token"

# Use management token for snapshots: consul snapshot save backup.snap -token=<management-token>

# Or set default token in config: # In consul.hcl: acl { tokens { default = "<management-token>" } } ```

Step 3: Check Disk Space

```bash # Check disk space for snapshot location df -h /backup

# Consul snapshots can be large: # - KV store data # - ACL policies # - Prepared queries # - Event history

# Check Consul data size: du -sh /opt/consul/data

# Estimate snapshot size: consul kv get -recurse | wc -l # Each KV entry contributes to snapshot size

# Create backup directory with enough space: mkdir -p /backup/consul df -h /backup/consul

# Use remote storage: consul snapshot save -remote=/backup/consul/snapshot.snap

# Or stream to remote server: consul snapshot save | ssh backup-server "cat > /backup/consul.snap" ```

Step 4: Check Server Health

```bash # Check all servers alive consul members

# Output: # Node Status # consul-1 alive # consul-2 alive # consul-3 alive

# Check server load ssh consul-1 "top -b -n 1 | head"

# Check Consul process on each server ssh consul-1 "ps aux | grep consul"

# Check Raft state on each server consul operator raft list-peers

# If one server overloaded, leader may not respond: # Check that leader is responsive: curl http://<leader>:8500/v1/status/leader

# Restart overloaded server: ssh consul-2 "systemctl restart consul" ```

Step 5: Increase Snapshot Timeout

```bash # Default timeout may be too short for large KV stores

# Check current timeout: consul snapshot save -timeout=30s backup.snap

# Increase timeout: consul snapshot save -timeout=120s backup.snap

# Or via HTTP API with longer timeout: curl -X GET "http://localhost:8500/v1/snapshot" \ --max-time 120 \ --output backup.snap

# For very large clusters: consul snapshot save -timeout=300s backup.snap

# Verify snapshot size: ls -lh backup.snap ```

Step 6: Verify Snapshot Integrity

```bash # After successful snapshot, verify it's valid

# Check snapshot file size ls -lh backup.snap

# Empty file means failed snapshot: ls backup.snap # Should be > 0 bytes

# Test snapshot restore on test cluster: consul snapshot restore backup.snap

# Output: # Restored snapshot with index: xxx

# Verify data restored: consul kv get -recurse

# Check ACLs restored: consul acl policy list

# Check nodes restored: consul members

# Compare checksums: sha256sum backup.snap sha256sum backup.snap.previous ```

Step 7: Automate Snapshot Backups

```bash # Create backup script cat << 'EOF' > /usr/local/bin/consul-backup.sh #!/bin/bash

BACKUP_DIR="/backup/consul" DATE=$(date +%Y%m%d-%H%M%S) SNAPSHOT_FILE="${BACKUP_DIR}/consul-${DATE}.snap" TOKEN="${CONSUL_MANAGEMENT_TOKEN}"

# Check leader exists LEADER=$(consul operator raft list-peers 2>/dev/null | grep leader) if [ -z "$LEADER" ]; then echo "ERROR: No leader, cannot create snapshot" exit 1 fi

# Create snapshot consul snapshot save -token=${TOKEN} -timeout=120s ${SNAPSHOT_FILE}

if [ $? -eq 0 ]; then echo "Snapshot saved: ${SNAPSHOT_FILE}"

# Verify snapshot SIZE=$(stat -c%s ${SNAPSHOT_FILE}) if [ $SIZE -gt 0 ]; then echo "Snapshot valid: ${SIZE} bytes"

# Remove old backups (keep 7 days) find ${BACKUP_DIR} -name "*.snap" -mtime +7 -delete

# Copy to remote storage scp ${SNAPSHOT_FILE} backup-server:/backup/consul/ else echo "ERROR: Snapshot is empty" rm ${SNAPSHOT_FILE} exit 1 fi else echo "ERROR: Snapshot failed" exit 1 fi EOF

chmod +x /usr/local/bin/consul-backup.sh

# Schedule daily backup: cat << 'EOF' > /etc/cron.d/consul-backup 0 2 * * * root /usr/local/bin/consul-backup.sh >> /var/log/consul-backup.log 2>&1 EOF ```

Step 8: Handle Snapshot Restore

```bash # When restoring from snapshot:

# Stop all Consul servers first systemctl stop consul

# On each server, remove existing data rm -rf /opt/consul/data/*

# Restore snapshot on first server: consul snapshot restore backup.snap

# Output: # Restored snapshot

# Start first server as bootstrap: consul agent -bootstrap-expect=1 -server \ -data-dir=/opt/consul/data \ -bind=10.0.0.1

# Wait for leader election consul operator raft list-peers

# Start other servers with join: consul agent -server -join=10.0.0.1 \ -data-dir=/opt/consul/data \ -bind=10.0.0.2

# Verify restore: consul kv get -recurse consul members consul acl policy list ```

Step 9: Check Network Connectivity

```bash # Check connectivity to leader LEADER=$(curl -s http://localhost:8500/v1/status/leader) echo "Leader: $LEADER"

# Check network to leader ping ${LEADER%%:*}

# Check port 8500 (HTTP API) nc -zv ${LEADER%%:*} 8500

# Check port 8300 (Raft) nc -zv ${LEADER%%:*} 8300

# Check firewall iptables -L -n | grep 8500 iptables -L -n | grep 8300

# Allow API port for snapshot: iptables -I INPUT -p tcp --dport 8500 -j ACCEPT

# Test snapshot via API: curl -X GET "http://${LEADER}/v1/snapshot" --output test.snap

# Verify file ls -lh test.snap ```

Step 10: Monitor Backup Health

```bash # Create backup monitoring cat << 'EOF' > /usr/local/bin/check-consul-backup.sh #!/bin/bash

BACKUP_DIR="/backup/consul"

# Check last backup exists LAST_BACKUP=$(ls -t ${BACKUP_DIR}/*.snap 2>/dev/null | head -1)

if [ -z "$LAST_BACKUP" ]; then echo "WARNING: No backup found" exit 1 fi

# Check backup age BACKUP_AGE=$(( ($(date +%s) - $(stat -c%Y $LAST_BACKUP)) / 3600 )) if [ $BACKUP_AGE -gt 24 ]; then echo "WARNING: Last backup is ${BACKUP_AGE} hours old" fi

# Check backup size BACKUP_SIZE=$(stat -c%s $LAST_BACKUP) if [ $BACKUP_SIZE -lt 100 ]; then echo "ERROR: Backup too small: ${BACKUP_SIZE} bytes" exit 1 fi

echo "OK: Latest backup ${LAST_BACKUP}, size ${BACKUP_SIZE}, age ${BACKUP_AGE}h" EOF

chmod +x /usr/local/bin/check-consul-backup.sh

# Prometheus alert for backup: - alert: ConsulBackupMissing expr: consul_backup_age_hours > 24 for: 1h labels: severity: warning annotations: summary: "Consul snapshot backup missing or old" ```

Consul Snapshot Backup Checklist

Check	Command	Expected
Leader exists	raft list-peers	Has leader
ACL permissions	acl token read	snapshot:write
Disk space	df -h	> snapshot size
Snapshot file	ls -lh	> 0 bytes
Restore test	snapshot restore	Success
Backup age	stat -c%Y	< 24 hours

Verify the Fix

```bash # After resolving snapshot issue

# 1. Create snapshot consul snapshot save backup.snap // Success! Snapshot saved

# 2. Verify file size ls -lh backup.snap // File exists with content

# 3. Test restore consul snapshot restore backup.snap // Restored successfully

# 4. Check backup schedule ls -la /backup/consul/*.snap // Recent backup exists

# 5. Verify leader stable consul operator raft list-peers // Leader present

# 6. Monitor backup logs tail /var/log/consul-backup.log // No errors ```

[Fix Consul KV Store Not Responding](/articles/fix-consul-kv-store-not-responding)
[Fix Consul Agent Not Starting](/articles/fix-consul-agent-not-starting)
[Fix Consul Service Not Registering](/articles/fix-consul-service-not-registering)

What's Actually Happening

The Error You'll See

Why This Happens

Step 1: Check Leader Status

Step 2: Check ACL Permissions

Step 3: Check Disk Space

Step 4: Check Server Health

Step 5: Increase Snapshot Timeout

Step 6: Verify Snapshot Integrity

Step 7: Automate Snapshot Backups

Step 8: Handle Snapshot Restore

Step 9: Check Network Connectivity

Step 10: Monitor Backup Health

Consul Snapshot Backup Checklist

Verify the Fix

Related Issues

Share this guide

More Monitoring Troubleshooting Guides

Metric Retention Expired

Timeseries Storage Full

Collector Agent Crashed

Webhook Notification Timeout

SMS Notification Failed

Fix Fluentd Log Not Sending