Fix Velero Backup Failed

What's Actually Happening

Velero Kubernetes backup operations fail, preventing cluster state and persistent volume data from being backed up to object storage.

The Error You'll See

Backup failed:

```bash $ velero backup describe my-backup

Phase: Failed Errors: 1 Warnings: 3 Failure Reason: error executing PVAction for persistent volume ```

Storage error:

```bash $ velero backup logs my-backup

ERROR: error accessing backup storage location: AccessDenied ERROR: failed to upload backup: connection refused ```

Snapshot failure:

bash

ERROR: failed to create volume snapshot: VolumeSnapshotClass not found
ERROR: snapshot creation timeout after 10m

Why This Happens

1.Storage access - S3/GCS/Azure Blob access denied
2.Volume snapshot - CSI snapshot failed or unsupported
3.Backup location - BackupStorageLocation not configured
4.Credentials expired - Cloud credentials need refresh
5.Network issues - Cannot reach object storage
6.Namespace errors - Restic/Velero errors in pods

Step 1: Check Backup Status

```bash # List backups: velero backup get

# Describe failed backup: velero backup describe my-backup --details

# View backup logs: velero backup logs my-backup

# Check backup storage location: velero backup-location get

# Check snapshot location: velero snapshot-location get

# Check Velero pods: kubectl get pods -n velero

# Check Velero logs: kubectl logs -n velero deployment/velero ```

Step 2: Check Backup Storage Location

```bash # List backup locations: velero backup-location get

# Describe location: velero backup-location describe default

# Check location config: kubectl get backupstoragelocation default -n velero -o yaml

# Should have: apiVersion: velero.io/v1 kind: BackupStorageLocation metadata: name: default namespace: velero spec: provider: aws objectStorage: bucket: my-velero-bucket prefix: backups config: region: us-east-1 accessMode: ReadWrite

# Check bucket exists: aws s3 ls s3://my-velero-bucket

# Test bucket access: aws s3 cp test.txt s3://my-velero-bucket/test.txt ```

Step 3: Check Cloud Credentials

```bash # Check credentials secret: kubectl get secret cloud-credentials -n velero -o yaml

# Decode credentials: kubectl get secret cloud-credentials -n velero -o jsonpath='{.data.cloud}' | base64 -d

# For AWS, should have: [default] aws_access_key_id = AKIA... aws_secret_access_key = ...

# Test credentials: kubectl exec -n velero deployment/velero -- env AWS_ACCESS_KEY_ID=xxx AWS_SECRET_ACCESS_KEY=xxx aws s3 ls

# Update credentials: kubectl create secret generic cloud-credentials \ --namespace velero \ --from-file cloud=/path/to/credentials \ --dry-run=client -o yaml | kubectl apply -f -

# Restart Velero: kubectl rollout restart deployment/velero -n velero ```

Step 4: Check Volume Snapshots

```bash # List volume snapshots: velero snapshot-location get

# Check VolumeSnapshotClass: kubectl get volumesnapshotclass

# Verify CSI driver supports snapshots: kubectl get csidriver

# Check VolumeSnapshotClass exists for your storage: kubectl get volumesnapshotclass -o yaml

# Example VolumeSnapshotClass: apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: velero-snapclass labels: velero.io/csi-volumesnapshot-class: "true" driver: ebs.csi.aws.com deletionPolicy: Delete

# Check if CSI snapshot controller running: kubectl get pods -n kube-system | grep snapshot

# For manual snapshot debugging: kubectl get volumesnapshots -A kubectl describe volumesnapshot <name> -n <namespace> ```

Step 5: Check Restic DaemonSet

```bash # For restic-based backups: kubectl get daemonset -n velero

# Check restic pods: kubectl get pods -n velero -l name=velero-restic

# Check restic logs: kubectl logs -n velero daemonset/velero-restic

# Check restic repository: velero repo get

# Check repo maintenance: velero repo describe default

# If restic repo corrupted: velero repo prune default

# Or recreate repo: velero repo delete default kubectl delete secret velero-restic-credentials -n velero

# Restic pod issues: kubectl describe pod -n velero -l name=velero-restic ```

Step 6: Check Namespace Inclusions

```bash # Check backup spec: kubectl get backup my-backup -n velero -o yaml

# Included namespaces: spec: includedNamespaces: - default - app-namespace excludedNamespaces: - kube-system - velero

# For all namespaces: spec: includedNamespaces: - '*' excludedNamespaces: - velero # Don't backup velero itself

# Check resource inclusions: spec: includedResources: - pods - persistentvolumeclaims - secrets - configmaps excludedResources: - events - pods.log

# Update backup spec: kubectl patch backup my-backup -n velero --type=merge -p '{"spec":{"includedNamespaces":["*"]}}' ```

Step 7: Check Velero Logs

```bash # Check Velero deployment logs: kubectl logs -n velero deployment/velero -f

# Look for specific errors: kubectl logs -n velero deployment/velero | grep -i error

# Check for storage errors: kubectl logs -n velero deployment/velero | grep -i "access|denied|bucket"

# Check for snapshot errors: kubectl logs -n velero deployment/velero | grep -i "snapshot|csi"

# Enable debug logging: kubectl set env deployment/velero -n velero LOG_LEVEL=debug

# Restart after: kubectl rollout restart deployment/velero -n velero

# Check debug logs: kubectl logs -n velero deployment/velero -f | grep -i debug ```

Step 8: Test Backup Manually

```bash # Create simple test backup: velero backup create test-backup \ --include-namespaces default \ --snapshot-volumes=false

# Check status: velero backup describe test-backup --details

# View logs: velero backup logs test-backup

# Check files in S3: aws s3 ls s3://my-velero-bucket/backups/test-backup/

# Verify backup contents: velero backup download test-backup tar -tzf test-backup.tar.gz

# If test backup succeeds, issue is with specific workload

# Test with specific namespace: velero backup create ns-backup \ --include-namespaces myapp \ --snapshot-volumes=true ```

Step 9: Check Resource Limits

```bash # Check Velero resource limits: kubectl describe deployment velero -n velero | grep -A 10 "Containers:"

# Increase resources: kubectl patch deployment velero -n velero --type='json' -p='[ {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "512Mi"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value": "500m"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "1"} ]'

# Check restic resources: kubectl patch daemonset velero-restic -n velero --type='json' -p='[ {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"} ]'

# Check for OOMKilled: kubectl get pods -n velero -o jsonpath='{.items[*].status.containerStatuses[?(@.lastState.terminated.reason=="OOMKilled")].name}' ```

Step 10: Schedule and Monitor

```bash # Create backup schedule: velero schedule create daily-backup \ --schedule="0 2 * * *" \ --include-namespaces "*" \ --exclude-namespaces "velero,kube-system" \ --snapshot-volumes=true

# Check schedule: velero schedule get

# Monitor backup status: velero backup get

# Create monitoring script: cat << 'EOF' > /usr/local/bin/monitor-velero.sh #!/bin/bash

echo "=== Velero Backups ===" velero backup get

echo "" echo "=== Failed Backups ===" velero backup get | grep Failed

echo "" echo "=== Backup Storage ===" velero backup-location get

echo "" echo "=== Restic Repos ===" velero repo get

echo "" echo "=== Recent Backup Logs ===" LATEST=$(velero backup get -o json | jq -r '.items[0].metadata.name') velero backup logs $LATEST | tail -20 EOF

chmod +x /usr/local/bin/monitor-velero.sh

# Prometheus metrics: kubectl port-forward -n velero svc/velero 8085:8085 & curl http://localhost:8085/metrics | grep velero_backup

# Key metrics: # velero_backup_attempt_total # velero_backup_success_total # velero_backup_failure_total # velero_backup_duration_seconds

# Alert for backup failures: - alert: VeleroBackupFailed expr: velero_backup_failure_total > 0 for: 1m labels: severity: critical annotations: summary: "Velero backup failed" ```

Velero Backup Failure Checklist

Check	Command	Expected
Backup status	velero backup get	Completed
Storage location	velero backup-location	Available
Credentials	secret cloud-credentials	Valid
CSI driver	kubectl get csidriver	Present
Restic pods	kubectl get ds	Running
Bucket access	aws s3 ls	Listed

Verify the Fix

```bash # After fixing backup issue

# 1. Create test backup velero backup create verify-backup --include-namespaces default // Phase: Completed

# 2. Check backup details velero backup describe verify-backup // No errors

# 3. Verify in storage aws s3 ls s3://my-velero-bucket/backups/verify-backup/ // Backup files present

# 4. Test restore velero restore create --from-backup verify-backup // Restore completed

# 5. Check schedule working velero schedule get // Next run scheduled

# 6. Monitor ongoing /usr/local/bin/monitor-velero.sh // All backups successful ```

[Fix Kubernetes Etcd Backup Failed](/articles/fix-etcd-wal-corrupted)
[Fix Consul Snapshot Backup Failed](/articles/fix-consul-snapshot-backup-failed)
[Fix MinIO Bucket Not Accessible](/articles/fix-minio-bucket-not-accessible)

What's Actually Happening

The Error You'll See

Why This Happens

Step 1: Check Backup Status

Step 2: Check Backup Storage Location

Step 3: Check Cloud Credentials

Step 4: Check Volume Snapshots

Step 5: Check Restic DaemonSet

Step 6: Check Namespace Inclusions

Step 7: Check Velero Logs

Step 8: Test Backup Manually

Step 9: Check Resource Limits

Step 10: Schedule and Monitor

Velero Backup Failure Checklist

Verify the Fix

Related Issues

Share this guide

More Kubernetes Troubleshooting Guides

Kubernetes Service Mesh Proxy Error

Kubernetes CronJob Not Scheduling

Kubernetes Admission Webhook Denied

Kubernetes Pod Security Denied

Kubernetes Network Policy Blocking

Kubernetes Weave Net Cross-Host Connectivity