# Argo Workflow Failed: Complete Troubleshooting Guide

Argo Workflows is a Kubernetes-native workflow orchestrator. When workflows fail, debugging requires understanding both Argo's workflow definitions and Kubernetes pod behavior. Failures can stem from template issues, resource constraints, pod execution problems, or cluster-level errors.

Let me walk through the most common Argo Workflow failures and how to fix each one.

Understanding Argo Workflow States

Workflow states indicate where problems might be:

StateMeaningCommon Cause
PendingWaiting to startResource constraints, admission issues
RunningCurrently executingNormal state
SucceededCompleted successfullyNone
FailedAt least one step failedTask error, pod crash
ErrorWorkflow couldn't runTemplate error, infrastructure issue

Fix 1: Workflow Template Syntax Errors

The workflow YAML has validation errors.

Symptoms: - "workflow is invalid" - Template not found - Admission webhook rejection

Diagnosis:

```bash # Validate workflow before submission argo lint workflow.yaml

# Check workflow template exists kubectl get workflowtemplates -n argo

# Check cluster workflow templates kubectl get clusterworkflowtemplates ```

Solution A: Fix YAML syntax:

```yaml # Common issues

# WRONG - missing spec apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: name: my-workflow spec: # Required at top level entrypoint: main templates: - name: main container: image: alpine command: [echo, hello]

# CORRECT - complete spec apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: name: my-workflow spec: entrypoint: main templates: - name: main container: image: alpine:3.18 command: [echo] args: [hello] ```

Solution B: Reference templates correctly:

yaml
# Using WorkflowTemplate reference
spec:
  entrypoint: main
  workflowTemplateRef:
    name: my-template  # Must exist as WorkflowTemplate

Solution C: Fix template names:

yaml
spec:
  entrypoint: main-steps  # Must match a template name below
  templates:
  - name: main-steps  # Matching name
    steps:
    - - name: step1
        template: hello  # Must match another template
  - name: hello
    container:
      image: alpine
      command: [echo, hello]

Fix 2: Pod Execution Failures

Workflow pods fail to run.

Symptoms: - Pod shows "Error" or "Failed" - Container exits with non-zero code - OOMKilled status

Diagnosis:

```bash # Get workflow details argo get my-workflow -n argo

# Find workflow pods kubectl get pods -n argo -l workflows.argoproj.io/workflow=my-workflow

# Check pod logs kubectl logs my-workflow-pod-123 -n argo

# Check pod status kubectl describe pod my-workflow-pod-123 -n argo

# Check specific step container kubectl logs my-workflow-pod-123 -c main -n argo ```

Solution A: Fix container command:

yaml
templates:
- name: build
  container:
    image: node:20
    command: [sh, -c]  # Use shell for complex commands
    args:
    - |
      npm install
      npm run build

Solution B: Fix failing script:

yaml
# Script template with error handling
templates:
- name: deploy
  script:
    image: python:3.11
    source: |
      import subprocess
      try:
          subprocess.run(['deploy.sh'], check=True)
      except subprocess.CalledProcessError as e:
          print(f"Deploy failed: {e}")
          raise

Solution C: Check exit codes:

```bash # In pod logs, look for: # "exit code 127" - command not found # "exit code 1" - general failure # "exit code 139" - segmentation fault (often OOM)

# For OOM, increase resources templates: - name: memory-intensive container: image: node:20 resources: requests: memory: "1Gi" limits: memory: "2Gi" ```

Fix 3: Resource Constraints and Scheduling

Pods can't be scheduled or crash from resource limits.

Symptoms: - "Insufficient cpu" - "Insufficient memory" - "0/1 nodes are available" - OOMKilled status

Diagnosis:

```bash # Check node resources kubectl describe nodes

# Check pending pod events kubectl describe pod pending-pod -n argo

# Check workflow resource requests argo get my-workflow -n argo ```

Solution A: Set resource requests:

yaml
templates:
- name: build
  container:
    image: node:20
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "2"
        memory: "4Gi"

Solution B: Use resource templates:

```yaml # Workflow-level resource limits spec: podSpecPatch: | containers: - name: main resources: limits: memory: "4Gi"

# Or use workflow-level defaults spec: podGC: strategy: OnWorkflowSuccess activeDeadlineSeconds: 3600 ```

Solution C: Configure node selectors:

yaml
templates:
- name: gpu-task
  container:
    image: tensorflow/tensorflow:latest-gpu
  nodeSelector:
    accelerator: nvidia-tesla-k80

Fix 4: Workflow Timeout Issues

Workflows or steps exceed time limits.

Symptoms: - "workflow exceeded activeDeadlineSeconds" - Step timeout - Workflow stuck in Running state

Solution A: Set workflow timeout:

```yaml spec: activeDeadlineSeconds: 3600 # 1 hour max

# For individual steps templates: - name: long-task container: image: alpine activeDeadlineSeconds: 300 # 5 minutes for this step ```

Solution B: Handle stuck workflows:

```bash # Stop stuck workflow argo stop my-workflow -n argo

# Terminate stuck workflow argo terminate my-workflow -n argo

# Delete workflow argo delete my-workflow -n argo ```

Solution C: Configure retry:

yaml
templates:
- name: flaky-task
  container:
    image: alpine
    command: [sh, -c, "curl https://flaky-api.com"]
  retryStrategy:
    limit: 3  # Retry 3 times
    backoff:
      duration: "5s"
      factor: 2
      maxDuration: "1m"

Fix 5: Artifact Handling Failures

Artifacts fail to upload or download.

Symptoms: - "failed to save artifact" - "failed to load artifact" - Artifact not found

Solution A: Configure artifact repository:

yaml
# In workflow-controller-configmap
apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
data:
  artifactRepository: |
    s3:
      bucket: my-bucket
      endpoint: s3.amazonaws.com
      accessKeySecret:
        name: aws-credentials
        key: accessKey
      secretKeySecret:
        name: aws-credentials
        key: secretKey

Solution B: Define input/output artifacts:

```yaml templates: - name: generate-artifact container: image: alpine command: [sh, -c] args: - | echo "artifact content" > /tmp/result.txt outputs: artifacts: - name: result path: /tmp/result.txt s3: bucket: my-bucket key: result.txt

  • name: consume-artifact
  • inputs:
  • artifacts:
  • - name: result
  • path: /tmp/input.txt
  • s3:
  • bucket: my-bucket
  • key: result.txt
  • `

Solution C: Use inline artifacts:

yaml
templates:
- name: pass-data
  container:
    image: alpine
    command: [sh, -c]
    args:
    - |
      cat <<EOF > /tmp/config.yaml
      key: value
      EOF
  outputs:
    artifacts:
    - name: config
      path: /tmp/config.yaml

Fix 6: Parameter and Input Issues

Parameters don't pass correctly between steps.

Symptoms: - {{inputs.parameters.xxx}} not resolved - Parameter value empty or wrong - Template rendering errors

Solution A: Define parameters correctly:

```yaml spec: entrypoint: main arguments: parameters: - name: message value: "hello world"

templates: - name: main inputs: parameters: - name: message container: image: alpine command: [echo] args: ["{{inputs.parameters.message}}"] ```

Solution B: Pass parameters between templates:

```yaml templates: - name: main steps: - - name: generate template: generate-value - - name: use-value template: print-value arguments: parameters: - name: value value: "{{steps.generate.outputs.result}}"

  • name: generate-value
  • container:
  • image: alpine
  • command: [sh, -c]
  • args: ["echo 'generated-value' > /tmp/result"]
  • outputs:
  • parameters:
  • - name: result
  • valueFrom:
  • path: /tmp/result
  • name: print-value
  • inputs:
  • parameters:
  • - name: value
  • container:
  • image: alpine
  • command: [echo]
  • args: ["{{inputs.parameters.value}}"]
  • `

Solution C: Use global parameters:

```yaml spec: entrypoint: main arguments: parameters: - name: global-value value: "shared"

templates: - name: use-global container: image: alpine command: [echo] args: ["{{workflow.parameters.global-value}}"] ```

Fix 7: DAG and Steps Execution Errors

Workflow DAG or steps don't execute correctly.

Symptoms: - Steps run in wrong order - Dependencies not respected - DAG validation errors

Solution A: Fix DAG dependencies:

yaml
templates:
- name: dag-workflow
  dag:
    tasks:
    - name: task-a
      template: process
    - name: task-b
      template: process
      dependencies: [task-a]  # Must complete before task-b
    - name: task-c
      template: process
      dependencies: [task-a, task-b]

Solution B: Fix steps sequence:

yaml
templates:
- name: steps-workflow
  steps:
  - - name: step-1a  # Parallel step (same list level)
      template: process
    - name: step-1b
      template: process
  - - name: step-2  # Sequential step (new list)
      template: process
      dependencies: [step-1a, step-1b]  # Wait for both

Solution C: Handle task failures:

yaml
templates:
- name: dag-with-fallback
  dag:
    tasks:
    - name: risky-task
      template: risky-process
      continueOn:
        failed: true  # Continue even if this fails
    - name: fallback
      template: fallback-process
      dependencies: [risky-task]

Fix 8: Service Account and RBAC Issues

Workflows fail due to permission errors.

Symptoms: - "cannot create resource" - "User cannot list resource" - RBAC denied errors

Solution A: Create workflow service account:

```yaml apiVersion: v1 kind: ServiceAccount metadata: name: workflow-sa namespace: argo

--- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: workflow-role namespace: argo rules: - apiGroups: [""] resources: [pods, pods/log] verbs: [create, get, list, watch, delete] - apiGroups: [argoproj.io] resources: [workflows, workflowtemplates] verbs: [create, get, list, watch, delete]

--- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: workflow-binding namespace: argo subjects: - kind: ServiceAccount name: workflow-sa roleRef: kind: Role name: workflow-role apiGroup: rbac.authorization.k8s.io ```

Solution B: Reference service account in workflow:

yaml
spec:
  serviceAccountName: workflow-sa
  entrypoint: main

Solution C: Check RBAC permissions:

bash
# Test permissions
kubectl auth can-i create pods --as=system:serviceaccount:argo:workflow-sa -n argo
kubectl auth can-i list workflows --as=system:serviceaccount:argo:workflow-sa -n argo

Fix 9: Volume and PVC Issues

Workflows fail to mount volumes.

Symptoms: - "PersistentVolumeClaim not found" - "FailedMount" - Volume mount timeout

Solution A: Create PVC for workflow:

```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: workflow-pvc spec: accessModes: [ReadWriteOnce] resources: requests: storage: 10Gi

--- apiVersion: argoproj.io/v1alpha1 kind: Workflow spec: volumes: - name: workdir persistentVolumeClaim: claimName: workflow-pvc templates: - name: use-volume container: image: alpine volumeMounts: - name: workdir mountPath: /workdir ```

Solution B: Use existing PV:

yaml
spec:
  volumes:
  - name: shared-data
    persistentVolumeClaim:
      claimName: existing-pvc

Solution C: Use ephemeral volume:

yaml
spec:
  volumes:
  - name: cache-volume
    emptyDir:
      sizeLimit: 500Mi

Fix 10: Parallelism and Resource Exhaustion

Too many parallel tasks exhaust resources.

Symptoms: - Cluster overloaded - Pods pending - Slow execution

Solution A: Limit parallelism:

```yaml spec: parallelism: 5 # Max 5 concurrent tasks

templates: - name: many-tasks dag: tasks: - name: task-{{item}} template: process withParam: "[1,2,3,4,5,6,7,8,9,10]" ```

Solution B: Use resource rate limiting:

yaml
spec:
  podGC:
    strategy: OnWorkflowSuccess  # Clean up immediately
  ttlStrategy:
    secondsAfterCompletion: 300  # Keep for 5 minutes after

Solution C: Distribute across nodes:

yaml
templates:
- name: distributed-task
  container:
    image: alpine
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: workflows.argoproj.io/workflow
              operator: In
              values:
              - my-workflow
          topologyKey: kubernetes.io/hostname

Quick Reference: Argo Errors

ErrorCauseSolution
Template invalidYAML syntaxUse argo lint
Pod failedContainer errorCheck logs, fix command
Insufficient resourcesNode limitsSet resources, add nodes
Workflow timeoutDeadline exceededIncrease activeDeadlineSeconds
Artifact failedS3/config issueConfigure artifact repository
Parameter emptyWrong syntaxUse {{inputs.parameters.xxx}}
RBAC deniedMissing permissionsCreate service account, role
Volume mount failedPVC missingCreate PVC, reference correctly

Debugging Commands

```bash # Validate workflow argo lint workflow.yaml

# Submit workflow argo submit workflow.yaml -n argo

# Get workflow status argo get my-workflow -n argo

# Watch workflow argo watch my-workflow -n argo

# Get workflow logs argo logs my-workflow -n argo

# List workflow pods kubectl get pods -n argo -l workflows.argoproj.io/workflow=my-workflow

# Describe pod kubectl describe pod pod-name -n argo

# Get pod container logs kubectl logs pod-name -c main -n argo

# Stop workflow argo stop my-workflow -n argo

# Delete workflow argo delete my-workflow -n argo

# List workflows argo list -n argo ```