What's Actually Happening

Velero Kubernetes backup operations fail, preventing cluster state and persistent volume data from being backed up to object storage.

The Error You'll See

Backup failed:

```bash $ velero backup describe my-backup

Phase: Failed Errors: 1 Warnings: 3 Failure Reason: error executing PVAction for persistent volume ```

Storage error:

```bash $ velero backup logs my-backup

ERROR: error accessing backup storage location: AccessDenied ERROR: failed to upload backup: connection refused ```

Snapshot failure:

bash
ERROR: failed to create volume snapshot: VolumeSnapshotClass not found
ERROR: snapshot creation timeout after 10m

Why This Happens

  1. 1.Storage access - S3/GCS/Azure Blob access denied
  2. 2.Volume snapshot - CSI snapshot failed or unsupported
  3. 3.Backup location - BackupStorageLocation not configured
  4. 4.Credentials expired - Cloud credentials need refresh
  5. 5.Network issues - Cannot reach object storage
  6. 6.Namespace errors - Restic/Velero errors in pods

Step 1: Check Backup Status

```bash # List backups: velero backup get

# Describe failed backup: velero backup describe my-backup --details

# View backup logs: velero backup logs my-backup

# Check backup storage location: velero backup-location get

# Check snapshot location: velero snapshot-location get

# Check Velero pods: kubectl get pods -n velero

# Check Velero logs: kubectl logs -n velero deployment/velero ```

Step 2: Check Backup Storage Location

```bash # List backup locations: velero backup-location get

# Describe location: velero backup-location describe default

# Check location config: kubectl get backupstoragelocation default -n velero -o yaml

# Should have: apiVersion: velero.io/v1 kind: BackupStorageLocation metadata: name: default namespace: velero spec: provider: aws objectStorage: bucket: my-velero-bucket prefix: backups config: region: us-east-1 accessMode: ReadWrite

# Check bucket exists: aws s3 ls s3://my-velero-bucket

# Test bucket access: aws s3 cp test.txt s3://my-velero-bucket/test.txt ```

Step 3: Check Cloud Credentials

```bash # Check credentials secret: kubectl get secret cloud-credentials -n velero -o yaml

# Decode credentials: kubectl get secret cloud-credentials -n velero -o jsonpath='{.data.cloud}' | base64 -d

# For AWS, should have: [default] aws_access_key_id = AKIA... aws_secret_access_key = ...

# Test credentials: kubectl exec -n velero deployment/velero -- env AWS_ACCESS_KEY_ID=xxx AWS_SECRET_ACCESS_KEY=xxx aws s3 ls

# Update credentials: kubectl create secret generic cloud-credentials \ --namespace velero \ --from-file cloud=/path/to/credentials \ --dry-run=client -o yaml | kubectl apply -f -

# Restart Velero: kubectl rollout restart deployment/velero -n velero ```

Step 4: Check Volume Snapshots

```bash # List volume snapshots: velero snapshot-location get

# Check VolumeSnapshotClass: kubectl get volumesnapshotclass

# Verify CSI driver supports snapshots: kubectl get csidriver

# Check VolumeSnapshotClass exists for your storage: kubectl get volumesnapshotclass -o yaml

# Example VolumeSnapshotClass: apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: velero-snapclass labels: velero.io/csi-volumesnapshot-class: "true" driver: ebs.csi.aws.com deletionPolicy: Delete

# Check if CSI snapshot controller running: kubectl get pods -n kube-system | grep snapshot

# For manual snapshot debugging: kubectl get volumesnapshots -A kubectl describe volumesnapshot <name> -n <namespace> ```

Step 5: Check Restic DaemonSet

```bash # For restic-based backups: kubectl get daemonset -n velero

# Check restic pods: kubectl get pods -n velero -l name=velero-restic

# Check restic logs: kubectl logs -n velero daemonset/velero-restic

# Check restic repository: velero repo get

# Check repo maintenance: velero repo describe default

# If restic repo corrupted: velero repo prune default

# Or recreate repo: velero repo delete default kubectl delete secret velero-restic-credentials -n velero

# Restic pod issues: kubectl describe pod -n velero -l name=velero-restic ```

Step 6: Check Namespace Inclusions

```bash # Check backup spec: kubectl get backup my-backup -n velero -o yaml

# Included namespaces: spec: includedNamespaces: - default - app-namespace excludedNamespaces: - kube-system - velero

# For all namespaces: spec: includedNamespaces: - '*' excludedNamespaces: - velero # Don't backup velero itself

# Check resource inclusions: spec: includedResources: - pods - persistentvolumeclaims - secrets - configmaps excludedResources: - events - pods.log

# Update backup spec: kubectl patch backup my-backup -n velero --type=merge -p '{"spec":{"includedNamespaces":["*"]}}' ```

Step 7: Check Velero Logs

```bash # Check Velero deployment logs: kubectl logs -n velero deployment/velero -f

# Look for specific errors: kubectl logs -n velero deployment/velero | grep -i error

# Check for storage errors: kubectl logs -n velero deployment/velero | grep -i "access|denied|bucket"

# Check for snapshot errors: kubectl logs -n velero deployment/velero | grep -i "snapshot|csi"

# Enable debug logging: kubectl set env deployment/velero -n velero LOG_LEVEL=debug

# Restart after: kubectl rollout restart deployment/velero -n velero

# Check debug logs: kubectl logs -n velero deployment/velero -f | grep -i debug ```

Step 8: Test Backup Manually

```bash # Create simple test backup: velero backup create test-backup \ --include-namespaces default \ --snapshot-volumes=false

# Check status: velero backup describe test-backup --details

# View logs: velero backup logs test-backup

# Check files in S3: aws s3 ls s3://my-velero-bucket/backups/test-backup/

# Verify backup contents: velero backup download test-backup tar -tzf test-backup.tar.gz

# If test backup succeeds, issue is with specific workload

# Test with specific namespace: velero backup create ns-backup \ --include-namespaces myapp \ --snapshot-volumes=true ```

Step 9: Check Resource Limits

```bash # Check Velero resource limits: kubectl describe deployment velero -n velero | grep -A 10 "Containers:"

# Increase resources: kubectl patch deployment velero -n velero --type='json' -p='[ {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "512Mi"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value": "500m"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "1"} ]'

# Check restic resources: kubectl patch daemonset velero-restic -n velero --type='json' -p='[ {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"} ]'

# Check for OOMKilled: kubectl get pods -n velero -o jsonpath='{.items[*].status.containerStatuses[?(@.lastState.terminated.reason=="OOMKilled")].name}' ```

Step 10: Schedule and Monitor

```bash # Create backup schedule: velero schedule create daily-backup \ --schedule="0 2 * * *" \ --include-namespaces "*" \ --exclude-namespaces "velero,kube-system" \ --snapshot-volumes=true

# Check schedule: velero schedule get

# Monitor backup status: velero backup get

# Create monitoring script: cat << 'EOF' > /usr/local/bin/monitor-velero.sh #!/bin/bash

echo "=== Velero Backups ===" velero backup get

echo "" echo "=== Failed Backups ===" velero backup get | grep Failed

echo "" echo "=== Backup Storage ===" velero backup-location get

echo "" echo "=== Restic Repos ===" velero repo get

echo "" echo "=== Recent Backup Logs ===" LATEST=$(velero backup get -o json | jq -r '.items[0].metadata.name') velero backup logs $LATEST | tail -20 EOF

chmod +x /usr/local/bin/monitor-velero.sh

# Prometheus metrics: kubectl port-forward -n velero svc/velero 8085:8085 & curl http://localhost:8085/metrics | grep velero_backup

# Key metrics: # velero_backup_attempt_total # velero_backup_success_total # velero_backup_failure_total # velero_backup_duration_seconds

# Alert for backup failures: - alert: VeleroBackupFailed expr: velero_backup_failure_total > 0 for: 1m labels: severity: critical annotations: summary: "Velero backup failed" ```

Velero Backup Failure Checklist

CheckCommandExpected
Backup statusvelero backup getCompleted
Storage locationvelero backup-locationAvailable
Credentialssecret cloud-credentialsValid
CSI driverkubectl get csidriverPresent
Restic podskubectl get dsRunning
Bucket accessaws s3 lsListed

Verify the Fix

```bash # After fixing backup issue

# 1. Create test backup velero backup create verify-backup --include-namespaces default // Phase: Completed

# 2. Check backup details velero backup describe verify-backup // No errors

# 3. Verify in storage aws s3 ls s3://my-velero-bucket/backups/verify-backup/ // Backup files present

# 4. Test restore velero restore create --from-backup verify-backup // Restore completed

# 5. Check schedule working velero schedule get // Next run scheduled

# 6. Monitor ongoing /usr/local/bin/monitor-velero.sh // All backups successful ```

  • [Fix Kubernetes Etcd Backup Failed](/articles/fix-etcd-wal-corrupted)
  • [Fix Consul Snapshot Backup Failed](/articles/fix-consul-snapshot-backup-failed)
  • [Fix MinIO Bucket Not Accessible](/articles/fix-minio-bucket-not-accessible)