What's Actually Happening
Velero Kubernetes backup operations fail, preventing cluster state and persistent volume data from being backed up to object storage.
The Error You'll See
Backup failed:
```bash $ velero backup describe my-backup
Phase: Failed Errors: 1 Warnings: 3 Failure Reason: error executing PVAction for persistent volume ```
Storage error:
```bash $ velero backup logs my-backup
ERROR: error accessing backup storage location: AccessDenied ERROR: failed to upload backup: connection refused ```
Snapshot failure:
ERROR: failed to create volume snapshot: VolumeSnapshotClass not found
ERROR: snapshot creation timeout after 10mWhy This Happens
- 1.Storage access - S3/GCS/Azure Blob access denied
- 2.Volume snapshot - CSI snapshot failed or unsupported
- 3.Backup location - BackupStorageLocation not configured
- 4.Credentials expired - Cloud credentials need refresh
- 5.Network issues - Cannot reach object storage
- 6.Namespace errors - Restic/Velero errors in pods
Step 1: Check Backup Status
```bash # List backups: velero backup get
# Describe failed backup: velero backup describe my-backup --details
# View backup logs: velero backup logs my-backup
# Check backup storage location: velero backup-location get
# Check snapshot location: velero snapshot-location get
# Check Velero pods: kubectl get pods -n velero
# Check Velero logs: kubectl logs -n velero deployment/velero ```
Step 2: Check Backup Storage Location
```bash # List backup locations: velero backup-location get
# Describe location: velero backup-location describe default
# Check location config: kubectl get backupstoragelocation default -n velero -o yaml
# Should have: apiVersion: velero.io/v1 kind: BackupStorageLocation metadata: name: default namespace: velero spec: provider: aws objectStorage: bucket: my-velero-bucket prefix: backups config: region: us-east-1 accessMode: ReadWrite
# Check bucket exists: aws s3 ls s3://my-velero-bucket
# Test bucket access: aws s3 cp test.txt s3://my-velero-bucket/test.txt ```
Step 3: Check Cloud Credentials
```bash # Check credentials secret: kubectl get secret cloud-credentials -n velero -o yaml
# Decode credentials: kubectl get secret cloud-credentials -n velero -o jsonpath='{.data.cloud}' | base64 -d
# For AWS, should have: [default] aws_access_key_id = AKIA... aws_secret_access_key = ...
# Test credentials: kubectl exec -n velero deployment/velero -- env AWS_ACCESS_KEY_ID=xxx AWS_SECRET_ACCESS_KEY=xxx aws s3 ls
# Update credentials: kubectl create secret generic cloud-credentials \ --namespace velero \ --from-file cloud=/path/to/credentials \ --dry-run=client -o yaml | kubectl apply -f -
# Restart Velero: kubectl rollout restart deployment/velero -n velero ```
Step 4: Check Volume Snapshots
```bash # List volume snapshots: velero snapshot-location get
# Check VolumeSnapshotClass: kubectl get volumesnapshotclass
# Verify CSI driver supports snapshots: kubectl get csidriver
# Check VolumeSnapshotClass exists for your storage: kubectl get volumesnapshotclass -o yaml
# Example VolumeSnapshotClass: apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: velero-snapclass labels: velero.io/csi-volumesnapshot-class: "true" driver: ebs.csi.aws.com deletionPolicy: Delete
# Check if CSI snapshot controller running: kubectl get pods -n kube-system | grep snapshot
# For manual snapshot debugging: kubectl get volumesnapshots -A kubectl describe volumesnapshot <name> -n <namespace> ```
Step 5: Check Restic DaemonSet
```bash # For restic-based backups: kubectl get daemonset -n velero
# Check restic pods: kubectl get pods -n velero -l name=velero-restic
# Check restic logs: kubectl logs -n velero daemonset/velero-restic
# Check restic repository: velero repo get
# Check repo maintenance: velero repo describe default
# If restic repo corrupted: velero repo prune default
# Or recreate repo: velero repo delete default kubectl delete secret velero-restic-credentials -n velero
# Restic pod issues: kubectl describe pod -n velero -l name=velero-restic ```
Step 6: Check Namespace Inclusions
```bash # Check backup spec: kubectl get backup my-backup -n velero -o yaml
# Included namespaces: spec: includedNamespaces: - default - app-namespace excludedNamespaces: - kube-system - velero
# For all namespaces: spec: includedNamespaces: - '*' excludedNamespaces: - velero # Don't backup velero itself
# Check resource inclusions: spec: includedResources: - pods - persistentvolumeclaims - secrets - configmaps excludedResources: - events - pods.log
# Update backup spec: kubectl patch backup my-backup -n velero --type=merge -p '{"spec":{"includedNamespaces":["*"]}}' ```
Step 7: Check Velero Logs
```bash # Check Velero deployment logs: kubectl logs -n velero deployment/velero -f
# Look for specific errors: kubectl logs -n velero deployment/velero | grep -i error
# Check for storage errors: kubectl logs -n velero deployment/velero | grep -i "access|denied|bucket"
# Check for snapshot errors: kubectl logs -n velero deployment/velero | grep -i "snapshot|csi"
# Enable debug logging: kubectl set env deployment/velero -n velero LOG_LEVEL=debug
# Restart after: kubectl rollout restart deployment/velero -n velero
# Check debug logs: kubectl logs -n velero deployment/velero -f | grep -i debug ```
Step 8: Test Backup Manually
```bash # Create simple test backup: velero backup create test-backup \ --include-namespaces default \ --snapshot-volumes=false
# Check status: velero backup describe test-backup --details
# View logs: velero backup logs test-backup
# Check files in S3: aws s3 ls s3://my-velero-bucket/backups/test-backup/
# Verify backup contents: velero backup download test-backup tar -tzf test-backup.tar.gz
# If test backup succeeds, issue is with specific workload
# Test with specific namespace: velero backup create ns-backup \ --include-namespaces myapp \ --snapshot-volumes=true ```
Step 9: Check Resource Limits
```bash # Check Velero resource limits: kubectl describe deployment velero -n velero | grep -A 10 "Containers:"
# Increase resources: kubectl patch deployment velero -n velero --type='json' -p='[ {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "512Mi"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value": "500m"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "1"} ]'
# Check restic resources: kubectl patch daemonset velero-restic -n velero --type='json' -p='[ {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"} ]'
# Check for OOMKilled: kubectl get pods -n velero -o jsonpath='{.items[*].status.containerStatuses[?(@.lastState.terminated.reason=="OOMKilled")].name}' ```
Step 10: Schedule and Monitor
```bash # Create backup schedule: velero schedule create daily-backup \ --schedule="0 2 * * *" \ --include-namespaces "*" \ --exclude-namespaces "velero,kube-system" \ --snapshot-volumes=true
# Check schedule: velero schedule get
# Monitor backup status: velero backup get
# Create monitoring script: cat << 'EOF' > /usr/local/bin/monitor-velero.sh #!/bin/bash
echo "=== Velero Backups ===" velero backup get
echo "" echo "=== Failed Backups ===" velero backup get | grep Failed
echo "" echo "=== Backup Storage ===" velero backup-location get
echo "" echo "=== Restic Repos ===" velero repo get
echo "" echo "=== Recent Backup Logs ===" LATEST=$(velero backup get -o json | jq -r '.items[0].metadata.name') velero backup logs $LATEST | tail -20 EOF
chmod +x /usr/local/bin/monitor-velero.sh
# Prometheus metrics: kubectl port-forward -n velero svc/velero 8085:8085 & curl http://localhost:8085/metrics | grep velero_backup
# Key metrics: # velero_backup_attempt_total # velero_backup_success_total # velero_backup_failure_total # velero_backup_duration_seconds
# Alert for backup failures: - alert: VeleroBackupFailed expr: velero_backup_failure_total > 0 for: 1m labels: severity: critical annotations: summary: "Velero backup failed" ```
Velero Backup Failure Checklist
| Check | Command | Expected |
|---|---|---|
| Backup status | velero backup get | Completed |
| Storage location | velero backup-location | Available |
| Credentials | secret cloud-credentials | Valid |
| CSI driver | kubectl get csidriver | Present |
| Restic pods | kubectl get ds | Running |
| Bucket access | aws s3 ls | Listed |
Verify the Fix
```bash # After fixing backup issue
# 1. Create test backup velero backup create verify-backup --include-namespaces default // Phase: Completed
# 2. Check backup details velero backup describe verify-backup // No errors
# 3. Verify in storage aws s3 ls s3://my-velero-bucket/backups/verify-backup/ // Backup files present
# 4. Test restore velero restore create --from-backup verify-backup // Restore completed
# 5. Check schedule working velero schedule get // Next run scheduled
# 6. Monitor ongoing /usr/local/bin/monitor-velero.sh // All backups successful ```
Related Issues
- [Fix Kubernetes Etcd Backup Failed](/articles/fix-etcd-wal-corrupted)
- [Fix Consul Snapshot Backup Failed](/articles/fix-consul-snapshot-backup-failed)
- [Fix MinIO Bucket Not Accessible](/articles/fix-minio-bucket-not-accessible)