Introduction

Kubernetes CrashLoopBackOff occurs when a pod repeatedly crashes and restarts, with Kubernetes backing off before restarting again. The pod enters a failure cycle where the container starts, crashes (exits with non-zero code), waits with exponential backoff (10s, 20s, 40s, up to 5 minutes), then restarts. This state indicates a fundamental application or configuration issue preventing successful startup. Common causes include application bugs, missing configuration, failed health checks, resource exhaustion, or dependency unavailability.

Symptoms

  • kubectl get pods shows CrashLoopBackOff or Error status
  • Pod restart count increases continuously
  • Container logs show startup errors or panics
  • Events show repeated Started container followed by Killing container
  • Issue appears after deploy, configuration change, or dependency outage
  • Different pods in same deployment may show different states (Running, CrashLoopBackOff)

Common Causes

  • Application crash during startup (code error, unhandled exception)
  • Missing or invalid ConfigMap/Secret values
  • Environment variables not set or incorrect
  • Resource limits too low (OOMKilled)
  • Liveness probe failing before application ready
  • Port binding conflicts or address already in use
  • Database or external dependency unavailable
  • Image pull errors or missing binaries in container

Step-by-Step Fix

### 1. Check pod status and restart count

Get detailed pod status:

```bash # Check pod status kubectl get pods -n <namespace>

# Output: # NAME READY STATUS RESTARTS AGE # myapp-5d4f6c7b8-x9y2z 0/1 CrashLoopBackOff 15 45m

# Get detailed pod information kubectl describe pod <pod-name> -n <namespace>

# Key sections to check: # - State: Waiting/CrashLoopBackOff # - Last State: Terminated (shows exit code) # - Reason: Error, OOMKilled, Completed # - Exit Code: 0 (success), 1-255 (error codes) # - Restart Count: Number of restarts ```

Exit code meanings: - 0: Clean exit (but container may have completed unexpectedly) - 1: Application error (unhandled exception, panic) - 137: OOMKilled (128 + SIGKILL=9) - memory limit exceeded - 143: SIGTERM (128 + 15) - graceful shutdown - 125: Container runtime error (image not found, permission denied) - 126: Command cannot execute (permission denied) - 127: Command not found

### 2. Check container logs

View application logs for error details:

```bash # Get current container logs kubectl logs <pod-name> -n <namespace>

# Get logs from previous (crashed) instance kubectl logs <pod-name> -n <namespace> --previous

# Follow logs in real-time kubectl logs -f <pod-name> -n <namespace>

# For multi-container pods kubectl logs <pod-name> -n <namespace> -c <container-name> kubectl logs <pod-name> -n <namespace> -c <container-name> --previous ```

Log analysis patterns:

```bash # Search for errors kubectl logs <pod-name> -n <namespace> --previous | grep -iE "error|exception|fatal|panic"

# Check startup sequence kubectl logs <pod-name> -n <namespace> --previous | head -50

# Check database connectivity errors kubectl logs <pod-name> -n <namespace> --previous | grep -iE "database|connection|sql|redis" ```

### 3. Check pod events

Kubernetes events reveal scheduling and lifecycle issues:

```bash # Get events for specific pod kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

# Get all namespace events sorted by time kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Or with describe kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Events:" ```

Key event types: - Scheduled: Pod assigned to node - Pulled: Container image pulled successfully - Created: Container created - Started: Container started - Killing: Container being killed (check reason) - BackOff: Restart backed off - Unhealthy: Probe failed

### 4. Check for OOMKilled (Out of Memory)

Memory limit exhaustion is a common cause:

```bash # Check if container was OOMKilled kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

# Should output: OOMKilled

# Check memory limits kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources.limits.memory}'

# Check actual memory usage before crash kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Last State:"

# If OOMKilled, increase memory limits kubectl edit deployment <deployment-name> -n <namespace>

# Update: spec: containers: - name: <container-name> resources: limits: memory: 512Mi # Increase from 256Mi requests: memory: 256Mi ```

### 5. Check liveness and readiness probe configuration

Probe failures can cause restart loops:

```bash # Check probe configuration kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].livenessProbe}' kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].readinessProbe}'

# Or with describe kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Liveness:" kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Readiness:" ```

Common probe issues:

```yaml # WRONG: Probe fires before application ready livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 # Too short for app startup periodSeconds: 10 failureThreshold: 3 # Fails after 30 seconds total

# CORRECT: Allow time for startup livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 60 # Wait 60s before first probe periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 # Fail after 3 consecutive failures successThreshold: 1

readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 ```

Temporarily disable probes for debugging:

bash kubectl edit deployment <deployment-name> -n <namespace> # Comment out livenessProbe and readinessProbe # Apply and observe if pod stays running

### 6. Check ConfigMap and Secret references

Missing configuration causes startup failures:

```bash # Check which ConfigMaps/Secrets pod references kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].envFrom}' kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].env}' kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].volumeMounts}'

# Check if ConfigMaps exist kubectl get configmap -n <namespace>

# Check if Secrets exist kubectl get secret -n <namespace>

# Validate ConfigMap content kubectl get configmap <configmap-name> -n <namespace> -o yaml

# Check for missing environment variables kubectl describe pod <pod-name> -n <namespace> | grep -A20 "Environment:" ```

Common issues: - ConfigMap/Secret doesn't exist in namespace - Key names don't match what application expects - ConfigMap mounted but application expects environment variable

### 7. Check application dependencies

Application may crash waiting for dependencies:

```bash # Check if database is reachable from pod kubectl exec <pod-name> -n <namespace> -it -- nc -zv <db-host> 5432

# Or with timeout kubectl exec <pod-name> -n <namespace> -it -- timeout 5 bash -c "cat < /dev/null > /dev/tcp/<db-host>/5432"

# Check if Redis/cache is available kubectl exec <pod-name> -n <namespace> -it -- redis-cli -h <redis-host> ping

# Check DNS resolution kubectl exec <pod-name> -n <namespace> -it -- nslookup <service-name>

# Test HTTP endpoints kubectl exec <pod-name> -n <namespace> -it -- curl -v http://<dependency-service>/health ```

### 8. Check for port conflicts

Application may fail to bind to port:

```bash # Check what port application is trying to bind kubectl logs <pod-name> -n <namespace> --previous | grep -iE "bind|listen|port|address"

# Common error: "address already in use" # Means another process in container has the port

# Check container for multiple processes kubectl exec <pod-name> -n <namespace> -it -- ps aux

# Check listening ports kubectl exec <pod-name> -n <namespace> -it -- netstat -tlnp

# Check if port is already bound kubectl exec <pod-name> -n <namespace> -it -- ss -tlnp | grep :8080 ```

Verify container port matches application:

```yaml # Deployment should expose correct port spec: containers: - name: app ports: - containerPort: 8080 # Must match what app binds to

# Application config should match # app-config.yaml server: port: 8080 # Same as containerPort ```

### 9. Debug with interactive shell

Get shell access to debug container:

```bash # Run interactive shell (if image has shell) kubectl run -it debug-pod -n <namespace> --image=<same-image> --rm --restart=Never -- /bin/sh

# Or override entrypoint for debugging kubectl run -it debug-pod -n <namespace> --image=<same-image> --rm --restart=Never -- /bin/bash

# Mount same volumes kubectl run -it debug-pod -n <namespace> \ --image=<same-image> \ --rm --restart=Never \ --overrides='{"spec":{"volumes":[{"name":"config","configMap":{"name":"<configmap>"}}],"containers":[{"name":"debug","image":"<image>","volumeMounts":[{"name":"config","mountPath":"/config"}]}]}}' \ -- /bin/sh ```

Check application binary and permissions:

```bash # Check if binary exists ls -la /app/myapp

# Check permissions ls -la /app/

# Check if binary can execute /app/myapp --version

# Check for missing libraries ldd /app/myapp 2>&1 | grep "not found" ```

### 10. Enable debug logging

Increase application log verbosity:

```bash # Add debug environment variable kubectl edit deployment <deployment-name> -n <namespace>

# Add to spec: spec: containers: - name: app env: - name: DEBUG value: "true" - name: LOG_LEVEL value: "debug" ```

For Java applications:

yaml spec: containers: - name: app env: - name: JAVA_OPTS value: "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005"

For Go applications (with zap/logrus):

yaml spec: containers: - name: app env: - name: LOG_FORMAT value: "json" - name: LOG_LEVEL value: "debug"

Prevention

  • Set appropriate resource requests and limits based on load testing
  • Configure liveness probes with adequate initialDelaySeconds
  • Use readiness probes to prevent traffic before fully ready
  • Implement graceful shutdown with SIGTERM handling
  • Add startup probes for slow-starting applications
  • Use PodDisruptionBudget to prevent simultaneous restarts
  • Implement proper health check endpoints (/healthz, /ready)
  • Monitor restart count as leading indicator

```yaml # Production-ready probe configuration livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3

readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3

startupProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 0 periodSeconds: 5 failureThreshold: 30 # Allow up to 150 seconds for startup ```

  • **OOMKilled**: Container exceeded memory limit
  • **ImagePullBackOff**: Container image cannot be pulled
  • **Error**: Container exited with non-zero code
  • **Pending**: Pod cannot be scheduled to a node