# Argo Workflow Failed: Complete Troubleshooting Guide
Argo Workflows is a Kubernetes-native workflow orchestrator. When workflows fail, debugging requires understanding both Argo's workflow definitions and Kubernetes pod behavior. Failures can stem from template issues, resource constraints, pod execution problems, or cluster-level errors.
Let me walk through the most common Argo Workflow failures and how to fix each one.
Understanding Argo Workflow States
Workflow states indicate where problems might be:
| State | Meaning | Common Cause |
|---|---|---|
| Pending | Waiting to start | Resource constraints, admission issues |
| Running | Currently executing | Normal state |
| Succeeded | Completed successfully | None |
| Failed | At least one step failed | Task error, pod crash |
| Error | Workflow couldn't run | Template error, infrastructure issue |
Fix 1: Workflow Template Syntax Errors
The workflow YAML has validation errors.
Symptoms: - "workflow is invalid" - Template not found - Admission webhook rejection
Diagnosis:
```bash # Validate workflow before submission argo lint workflow.yaml
# Check workflow template exists kubectl get workflowtemplates -n argo
# Check cluster workflow templates kubectl get clusterworkflowtemplates ```
Solution A: Fix YAML syntax:
```yaml # Common issues
# WRONG - missing spec apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: name: my-workflow spec: # Required at top level entrypoint: main templates: - name: main container: image: alpine command: [echo, hello]
# CORRECT - complete spec apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: name: my-workflow spec: entrypoint: main templates: - name: main container: image: alpine:3.18 command: [echo] args: [hello] ```
Solution B: Reference templates correctly:
# Using WorkflowTemplate reference
spec:
entrypoint: main
workflowTemplateRef:
name: my-template # Must exist as WorkflowTemplateSolution C: Fix template names:
spec:
entrypoint: main-steps # Must match a template name below
templates:
- name: main-steps # Matching name
steps:
- - name: step1
template: hello # Must match another template
- name: hello
container:
image: alpine
command: [echo, hello]Fix 2: Pod Execution Failures
Workflow pods fail to run.
Symptoms: - Pod shows "Error" or "Failed" - Container exits with non-zero code - OOMKilled status
Diagnosis:
```bash # Get workflow details argo get my-workflow -n argo
# Find workflow pods kubectl get pods -n argo -l workflows.argoproj.io/workflow=my-workflow
# Check pod logs kubectl logs my-workflow-pod-123 -n argo
# Check pod status kubectl describe pod my-workflow-pod-123 -n argo
# Check specific step container kubectl logs my-workflow-pod-123 -c main -n argo ```
Solution A: Fix container command:
templates:
- name: build
container:
image: node:20
command: [sh, -c] # Use shell for complex commands
args:
- |
npm install
npm run buildSolution B: Fix failing script:
# Script template with error handling
templates:
- name: deploy
script:
image: python:3.11
source: |
import subprocess
try:
subprocess.run(['deploy.sh'], check=True)
except subprocess.CalledProcessError as e:
print(f"Deploy failed: {e}")
raiseSolution C: Check exit codes:
```bash # In pod logs, look for: # "exit code 127" - command not found # "exit code 1" - general failure # "exit code 139" - segmentation fault (often OOM)
# For OOM, increase resources templates: - name: memory-intensive container: image: node:20 resources: requests: memory: "1Gi" limits: memory: "2Gi" ```
Fix 3: Resource Constraints and Scheduling
Pods can't be scheduled or crash from resource limits.
Symptoms: - "Insufficient cpu" - "Insufficient memory" - "0/1 nodes are available" - OOMKilled status
Diagnosis:
```bash # Check node resources kubectl describe nodes
# Check pending pod events kubectl describe pod pending-pod -n argo
# Check workflow resource requests argo get my-workflow -n argo ```
Solution A: Set resource requests:
templates:
- name: build
container:
image: node:20
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"Solution B: Use resource templates:
```yaml # Workflow-level resource limits spec: podSpecPatch: | containers: - name: main resources: limits: memory: "4Gi"
# Or use workflow-level defaults spec: podGC: strategy: OnWorkflowSuccess activeDeadlineSeconds: 3600 ```
Solution C: Configure node selectors:
templates:
- name: gpu-task
container:
image: tensorflow/tensorflow:latest-gpu
nodeSelector:
accelerator: nvidia-tesla-k80Fix 4: Workflow Timeout Issues
Workflows or steps exceed time limits.
Symptoms: - "workflow exceeded activeDeadlineSeconds" - Step timeout - Workflow stuck in Running state
Solution A: Set workflow timeout:
```yaml spec: activeDeadlineSeconds: 3600 # 1 hour max
# For individual steps templates: - name: long-task container: image: alpine activeDeadlineSeconds: 300 # 5 minutes for this step ```
Solution B: Handle stuck workflows:
```bash # Stop stuck workflow argo stop my-workflow -n argo
# Terminate stuck workflow argo terminate my-workflow -n argo
# Delete workflow argo delete my-workflow -n argo ```
Solution C: Configure retry:
templates:
- name: flaky-task
container:
image: alpine
command: [sh, -c, "curl https://flaky-api.com"]
retryStrategy:
limit: 3 # Retry 3 times
backoff:
duration: "5s"
factor: 2
maxDuration: "1m"Fix 5: Artifact Handling Failures
Artifacts fail to upload or download.
Symptoms: - "failed to save artifact" - "failed to load artifact" - Artifact not found
Solution A: Configure artifact repository:
# In workflow-controller-configmap
apiVersion: v1
kind: ConfigMap
metadata:
name: workflow-controller-configmap
data:
artifactRepository: |
s3:
bucket: my-bucket
endpoint: s3.amazonaws.com
accessKeySecret:
name: aws-credentials
key: accessKey
secretKeySecret:
name: aws-credentials
key: secretKeySolution B: Define input/output artifacts:
```yaml templates: - name: generate-artifact container: image: alpine command: [sh, -c] args: - | echo "artifact content" > /tmp/result.txt outputs: artifacts: - name: result path: /tmp/result.txt s3: bucket: my-bucket key: result.txt
- name: consume-artifact
- inputs:
- artifacts:
- - name: result
- path: /tmp/input.txt
- s3:
- bucket: my-bucket
- key: result.txt
`
Solution C: Use inline artifacts:
templates:
- name: pass-data
container:
image: alpine
command: [sh, -c]
args:
- |
cat <<EOF > /tmp/config.yaml
key: value
EOF
outputs:
artifacts:
- name: config
path: /tmp/config.yamlFix 6: Parameter and Input Issues
Parameters don't pass correctly between steps.
Symptoms:
- {{inputs.parameters.xxx}} not resolved
- Parameter value empty or wrong
- Template rendering errors
Solution A: Define parameters correctly:
```yaml spec: entrypoint: main arguments: parameters: - name: message value: "hello world"
templates: - name: main inputs: parameters: - name: message container: image: alpine command: [echo] args: ["{{inputs.parameters.message}}"] ```
Solution B: Pass parameters between templates:
```yaml templates: - name: main steps: - - name: generate template: generate-value - - name: use-value template: print-value arguments: parameters: - name: value value: "{{steps.generate.outputs.result}}"
- name: generate-value
- container:
- image: alpine
- command: [sh, -c]
- args: ["echo 'generated-value' > /tmp/result"]
- outputs:
- parameters:
- - name: result
- valueFrom:
- path: /tmp/result
- name: print-value
- inputs:
- parameters:
- - name: value
- container:
- image: alpine
- command: [echo]
- args: ["{{inputs.parameters.value}}"]
`
Solution C: Use global parameters:
```yaml spec: entrypoint: main arguments: parameters: - name: global-value value: "shared"
templates: - name: use-global container: image: alpine command: [echo] args: ["{{workflow.parameters.global-value}}"] ```
Fix 7: DAG and Steps Execution Errors
Workflow DAG or steps don't execute correctly.
Symptoms: - Steps run in wrong order - Dependencies not respected - DAG validation errors
Solution A: Fix DAG dependencies:
templates:
- name: dag-workflow
dag:
tasks:
- name: task-a
template: process
- name: task-b
template: process
dependencies: [task-a] # Must complete before task-b
- name: task-c
template: process
dependencies: [task-a, task-b]Solution B: Fix steps sequence:
templates:
- name: steps-workflow
steps:
- - name: step-1a # Parallel step (same list level)
template: process
- name: step-1b
template: process
- - name: step-2 # Sequential step (new list)
template: process
dependencies: [step-1a, step-1b] # Wait for bothSolution C: Handle task failures:
templates:
- name: dag-with-fallback
dag:
tasks:
- name: risky-task
template: risky-process
continueOn:
failed: true # Continue even if this fails
- name: fallback
template: fallback-process
dependencies: [risky-task]Fix 8: Service Account and RBAC Issues
Workflows fail due to permission errors.
Symptoms: - "cannot create resource" - "User cannot list resource" - RBAC denied errors
Solution A: Create workflow service account:
```yaml apiVersion: v1 kind: ServiceAccount metadata: name: workflow-sa namespace: argo
--- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: workflow-role namespace: argo rules: - apiGroups: [""] resources: [pods, pods/log] verbs: [create, get, list, watch, delete] - apiGroups: [argoproj.io] resources: [workflows, workflowtemplates] verbs: [create, get, list, watch, delete]
--- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: workflow-binding namespace: argo subjects: - kind: ServiceAccount name: workflow-sa roleRef: kind: Role name: workflow-role apiGroup: rbac.authorization.k8s.io ```
Solution B: Reference service account in workflow:
spec:
serviceAccountName: workflow-sa
entrypoint: mainSolution C: Check RBAC permissions:
# Test permissions
kubectl auth can-i create pods --as=system:serviceaccount:argo:workflow-sa -n argo
kubectl auth can-i list workflows --as=system:serviceaccount:argo:workflow-sa -n argoFix 9: Volume and PVC Issues
Workflows fail to mount volumes.
Symptoms: - "PersistentVolumeClaim not found" - "FailedMount" - Volume mount timeout
Solution A: Create PVC for workflow:
```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: workflow-pvc spec: accessModes: [ReadWriteOnce] resources: requests: storage: 10Gi
--- apiVersion: argoproj.io/v1alpha1 kind: Workflow spec: volumes: - name: workdir persistentVolumeClaim: claimName: workflow-pvc templates: - name: use-volume container: image: alpine volumeMounts: - name: workdir mountPath: /workdir ```
Solution B: Use existing PV:
spec:
volumes:
- name: shared-data
persistentVolumeClaim:
claimName: existing-pvcSolution C: Use ephemeral volume:
spec:
volumes:
- name: cache-volume
emptyDir:
sizeLimit: 500MiFix 10: Parallelism and Resource Exhaustion
Too many parallel tasks exhaust resources.
Symptoms: - Cluster overloaded - Pods pending - Slow execution
Solution A: Limit parallelism:
```yaml spec: parallelism: 5 # Max 5 concurrent tasks
templates: - name: many-tasks dag: tasks: - name: task-{{item}} template: process withParam: "[1,2,3,4,5,6,7,8,9,10]" ```
Solution B: Use resource rate limiting:
spec:
podGC:
strategy: OnWorkflowSuccess # Clean up immediately
ttlStrategy:
secondsAfterCompletion: 300 # Keep for 5 minutes afterSolution C: Distribute across nodes:
templates:
- name: distributed-task
container:
image: alpine
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: workflows.argoproj.io/workflow
operator: In
values:
- my-workflow
topologyKey: kubernetes.io/hostnameQuick Reference: Argo Errors
| Error | Cause | Solution |
|---|---|---|
| Template invalid | YAML syntax | Use argo lint |
| Pod failed | Container error | Check logs, fix command |
| Insufficient resources | Node limits | Set resources, add nodes |
| Workflow timeout | Deadline exceeded | Increase activeDeadlineSeconds |
| Artifact failed | S3/config issue | Configure artifact repository |
| Parameter empty | Wrong syntax | Use {{inputs.parameters.xxx}} |
| RBAC denied | Missing permissions | Create service account, role |
| Volume mount failed | PVC missing | Create PVC, reference correctly |
Debugging Commands
```bash # Validate workflow argo lint workflow.yaml
# Submit workflow argo submit workflow.yaml -n argo
# Get workflow status argo get my-workflow -n argo
# Watch workflow argo watch my-workflow -n argo
# Get workflow logs argo logs my-workflow -n argo
# List workflow pods kubectl get pods -n argo -l workflows.argoproj.io/workflow=my-workflow
# Describe pod kubectl describe pod pod-name -n argo
# Get pod container logs kubectl logs pod-name -c main -n argo
# Stop workflow argo stop my-workflow -n argo
# Delete workflow argo delete my-workflow -n argo
# List workflows argo list -n argo ```