Fix Argo Workflow Failures - Complete Troubleshooting Guide

# Argo Workflow Failed: Complete Troubleshooting Guide

Argo Workflows is a Kubernetes-native workflow orchestrator. When workflows fail, debugging requires understanding both Argo's workflow definitions and Kubernetes pod behavior. Failures can stem from template issues, resource constraints, pod execution problems, or cluster-level errors.

Let me walk through the most common Argo Workflow failures and how to fix each one.

Understanding Argo Workflow States

Workflow states indicate where problems might be:

State	Meaning	Common Cause
Pending	Waiting to start	Resource constraints, admission issues
Running	Currently executing	Normal state
Succeeded	Completed successfully	None
Failed	At least one step failed	Task error, pod crash
Error	Workflow couldn't run	Template error, infrastructure issue

Fix 1: Workflow Template Syntax Errors

The workflow YAML has validation errors.

Symptoms: - "workflow is invalid" - Template not found - Admission webhook rejection

Diagnosis:

```bash # Validate workflow before submission argo lint workflow.yaml

# Check workflow template exists kubectl get workflowtemplates -n argo

# Check cluster workflow templates kubectl get clusterworkflowtemplates ```

Solution A: Fix YAML syntax:

```yaml # Common issues

# WRONG - missing spec apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: name: my-workflow spec: # Required at top level entrypoint: main templates: - name: main container: image: alpine command: [echo, hello]

# CORRECT - complete spec apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: name: my-workflow spec: entrypoint: main templates: - name: main container: image: alpine:3.18 command: [echo] args: [hello] ```

Solution B: Reference templates correctly:

yaml

# Using WorkflowTemplate reference
spec:
  entrypoint: main
  workflowTemplateRef:
    name: my-template  # Must exist as WorkflowTemplate

Solution C: Fix template names:

yaml

spec:
  entrypoint: main-steps  # Must match a template name below
  templates:
  - name: main-steps  # Matching name
    steps:
    - - name: step1
        template: hello  # Must match another template
  - name: hello
    container:
      image: alpine
      command: [echo, hello]

Fix 2: Pod Execution Failures

Workflow pods fail to run.

Symptoms: - Pod shows "Error" or "Failed" - Container exits with non-zero code - OOMKilled status

Diagnosis:

```bash # Get workflow details argo get my-workflow -n argo

# Find workflow pods kubectl get pods -n argo -l workflows.argoproj.io/workflow=my-workflow

# Check pod logs kubectl logs my-workflow-pod-123 -n argo

# Check pod status kubectl describe pod my-workflow-pod-123 -n argo

# Check specific step container kubectl logs my-workflow-pod-123 -c main -n argo ```

Solution A: Fix container command:

yaml

templates:
- name: build
  container:
    image: node:20
    command: [sh, -c]  # Use shell for complex commands
    args:
    - |
      npm install
      npm run build

Solution B: Fix failing script:

yaml

# Script template with error handling
templates:
- name: deploy
  script:
    image: python:3.11
    source: |
      import subprocess
      try:
          subprocess.run(['deploy.sh'], check=True)
      except subprocess.CalledProcessError as e:
          print(f"Deploy failed: {e}")
          raise

Solution C: Check exit codes:

```bash # In pod logs, look for: # "exit code 127" - command not found # "exit code 1" - general failure # "exit code 139" - segmentation fault (often OOM)

# For OOM, increase resources templates: - name: memory-intensive container: image: node:20 resources: requests: memory: "1Gi" limits: memory: "2Gi" ```

Fix 3: Resource Constraints and Scheduling

Pods can't be scheduled or crash from resource limits.

Symptoms: - "Insufficient cpu" - "Insufficient memory" - "0/1 nodes are available" - OOMKilled status

Diagnosis:

```bash # Check node resources kubectl describe nodes

# Check pending pod events kubectl describe pod pending-pod -n argo

# Check workflow resource requests argo get my-workflow -n argo ```

Solution A: Set resource requests:

yaml

templates:
- name: build
  container:
    image: node:20
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "2"
        memory: "4Gi"

Solution B: Use resource templates:

```yaml # Workflow-level resource limits spec: podSpecPatch: | containers: - name: main resources: limits: memory: "4Gi"

# Or use workflow-level defaults spec: podGC: strategy: OnWorkflowSuccess activeDeadlineSeconds: 3600 ```

Solution C: Configure node selectors:

yaml

templates:
- name: gpu-task
  container:
    image: tensorflow/tensorflow:latest-gpu
  nodeSelector:
    accelerator: nvidia-tesla-k80

Fix 4: Workflow Timeout Issues

Workflows or steps exceed time limits.

Symptoms: - "workflow exceeded activeDeadlineSeconds" - Step timeout - Workflow stuck in Running state

Solution A: Set workflow timeout:

```yaml spec: activeDeadlineSeconds: 3600 # 1 hour max

# For individual steps templates: - name: long-task container: image: alpine activeDeadlineSeconds: 300 # 5 minutes for this step ```

Solution B: Handle stuck workflows:

```bash # Stop stuck workflow argo stop my-workflow -n argo

# Terminate stuck workflow argo terminate my-workflow -n argo

# Delete workflow argo delete my-workflow -n argo ```

Solution C: Configure retry:

yaml

templates:
- name: flaky-task
  container:
    image: alpine
    command: [sh, -c, "curl https://flaky-api.com"]
  retryStrategy:
    limit: 3  # Retry 3 times
    backoff:
      duration: "5s"
      factor: 2
      maxDuration: "1m"

Fix 5: Artifact Handling Failures

Artifacts fail to upload or download.

Symptoms: - "failed to save artifact" - "failed to load artifact" - Artifact not found

Solution A: Configure artifact repository:

yaml

# In workflow-controller-configmap
apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
data:
  artifactRepository: |
    s3:
      bucket: my-bucket
      endpoint: s3.amazonaws.com
      accessKeySecret:
        name: aws-credentials
        key: accessKey
      secretKeySecret:
        name: aws-credentials
        key: secretKey

Solution B: Define input/output artifacts:

```yaml templates: - name: generate-artifact container: image: alpine command: [sh, -c] args: - | echo "artifact content" > /tmp/result.txt outputs: artifacts: - name: result path: /tmp/result.txt s3: bucket: my-bucket key: result.txt

name: consume-artifact
inputs:
artifacts:
- name: result
path: /tmp/input.txt
s3:
bucket: my-bucket
key: result.txt
`

Solution C: Use inline artifacts:

yaml

templates:
- name: pass-data
  container:
    image: alpine
    command: [sh, -c]
    args:
    - |
      cat <<EOF > /tmp/config.yaml
      key: value
      EOF
  outputs:
    artifacts:
    - name: config
      path: /tmp/config.yaml

Fix 6: Parameter and Input Issues

Parameters don't pass correctly between steps.

Symptoms: - {{inputs.parameters.xxx}} not resolved - Parameter value empty or wrong - Template rendering errors

Solution A: Define parameters correctly:

```yaml spec: entrypoint: main arguments: parameters: - name: message value: "hello world"

templates: - name: main inputs: parameters: - name: message container: image: alpine command: [echo] args: ["{{inputs.parameters.message}}"] ```

Solution B: Pass parameters between templates:

```yaml templates: - name: main steps: - - name: generate template: generate-value - - name: use-value template: print-value arguments: parameters: - name: value value: "{{steps.generate.outputs.result}}"

name: generate-value
container:
image: alpine
command: [sh, -c]
args: ["echo 'generated-value' > /tmp/result"]
outputs:
parameters:
- name: result
valueFrom:
path: /tmp/result

name: print-value
inputs:
parameters:
- name: value
container:
image: alpine
command: [echo]
args: ["{{inputs.parameters.value}}"]
`

Solution C: Use global parameters:

```yaml spec: entrypoint: main arguments: parameters: - name: global-value value: "shared"

templates: - name: use-global container: image: alpine command: [echo] args: ["{{workflow.parameters.global-value}}"] ```

Fix 7: DAG and Steps Execution Errors

Workflow DAG or steps don't execute correctly.

Symptoms: - Steps run in wrong order - Dependencies not respected - DAG validation errors

Solution A: Fix DAG dependencies:

yaml

templates:
- name: dag-workflow
  dag:
    tasks:
    - name: task-a
      template: process
    - name: task-b
      template: process
      dependencies: [task-a]  # Must complete before task-b
    - name: task-c
      template: process
      dependencies: [task-a, task-b]

Solution B: Fix steps sequence:

yaml

templates:
- name: steps-workflow
  steps:
  - - name: step-1a  # Parallel step (same list level)
      template: process
    - name: step-1b
      template: process
  - - name: step-2  # Sequential step (new list)
      template: process
      dependencies: [step-1a, step-1b]  # Wait for both

Solution C: Handle task failures:

yaml

templates:
- name: dag-with-fallback
  dag:
    tasks:
    - name: risky-task
      template: risky-process
      continueOn:
        failed: true  # Continue even if this fails
    - name: fallback
      template: fallback-process
      dependencies: [risky-task]

Fix 8: Service Account and RBAC Issues

Workflows fail due to permission errors.

Symptoms: - "cannot create resource" - "User cannot list resource" - RBAC denied errors

Solution A: Create workflow service account:

```yaml apiVersion: v1 kind: ServiceAccount metadata: name: workflow-sa namespace: argo

--- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: workflow-role namespace: argo rules: - apiGroups: [""] resources: [pods, pods/log] verbs: [create, get, list, watch, delete] - apiGroups: [argoproj.io] resources: [workflows, workflowtemplates] verbs: [create, get, list, watch, delete]

--- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: workflow-binding namespace: argo subjects: - kind: ServiceAccount name: workflow-sa roleRef: kind: Role name: workflow-role apiGroup: rbac.authorization.k8s.io ```

Solution B: Reference service account in workflow:

yaml

spec:
  serviceAccountName: workflow-sa
  entrypoint: main

Solution C: Check RBAC permissions:

bash

# Test permissions
kubectl auth can-i create pods --as=system:serviceaccount:argo:workflow-sa -n argo
kubectl auth can-i list workflows --as=system:serviceaccount:argo:workflow-sa -n argo

Fix 9: Volume and PVC Issues

Workflows fail to mount volumes.

Symptoms: - "PersistentVolumeClaim not found" - "FailedMount" - Volume mount timeout

Solution A: Create PVC for workflow:

```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: workflow-pvc spec: accessModes: [ReadWriteOnce] resources: requests: storage: 10Gi

--- apiVersion: argoproj.io/v1alpha1 kind: Workflow spec: volumes: - name: workdir persistentVolumeClaim: claimName: workflow-pvc templates: - name: use-volume container: image: alpine volumeMounts: - name: workdir mountPath: /workdir ```

Solution B: Use existing PV:

yaml

spec:
  volumes:
  - name: shared-data
    persistentVolumeClaim:
      claimName: existing-pvc

Solution C: Use ephemeral volume:

yaml

spec:
  volumes:
  - name: cache-volume
    emptyDir:
      sizeLimit: 500Mi

Fix 10: Parallelism and Resource Exhaustion

Too many parallel tasks exhaust resources.

Symptoms: - Cluster overloaded - Pods pending - Slow execution

Solution A: Limit parallelism:

```yaml spec: parallelism: 5 # Max 5 concurrent tasks

templates: - name: many-tasks dag: tasks: - name: task-{{item}} template: process withParam: "[1,2,3,4,5,6,7,8,9,10]" ```

Solution B: Use resource rate limiting:

yaml

spec:
  podGC:
    strategy: OnWorkflowSuccess  # Clean up immediately
  ttlStrategy:
    secondsAfterCompletion: 300  # Keep for 5 minutes after

Solution C: Distribute across nodes:

yaml

templates:
- name: distributed-task
  container:
    image: alpine
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: workflows.argoproj.io/workflow
              operator: In
              values:
              - my-workflow
          topologyKey: kubernetes.io/hostname

Quick Reference: Argo Errors

Error	Cause	Solution
Template invalid	YAML syntax	Use argo lint
Pod failed	Container error	Check logs, fix command
Insufficient resources	Node limits	Set resources, add nodes
Workflow timeout	Deadline exceeded	Increase activeDeadlineSeconds
Artifact failed	S3/config issue	Configure artifact repository
Parameter empty	Wrong syntax	Use {{inputs.parameters.xxx}}
RBAC denied	Missing permissions	Create service account, role
Volume mount failed	PVC missing	Create PVC, reference correctly

Debugging Commands

```bash # Validate workflow argo lint workflow.yaml

# Submit workflow argo submit workflow.yaml -n argo

# Get workflow status argo get my-workflow -n argo

# Watch workflow argo watch my-workflow -n argo

# Get workflow logs argo logs my-workflow -n argo

# List workflow pods kubectl get pods -n argo -l workflows.argoproj.io/workflow=my-workflow

# Describe pod kubectl describe pod pod-name -n argo

# Get pod container logs kubectl logs pod-name -c main -n argo

# Stop workflow argo stop my-workflow -n argo

# Delete workflow argo delete my-workflow -n argo

# List workflows argo list -n argo ```

Argo Workflow Failed: Complete Troubleshooting Guide

Understanding Argo Workflow States

Fix 1: Workflow Template Syntax Errors

Fix 2: Pod Execution Failures

Fix 3: Resource Constraints and Scheduling

Fix 4: Workflow Timeout Issues

Fix 5: Artifact Handling Failures

Fix 6: Parameter and Input Issues

Fix 7: DAG and Steps Execution Errors

Fix 8: Service Account and RBAC Issues

Fix 9: Volume and PVC Issues

Fix 10: Parallelism and Resource Exhaustion

Quick Reference: Argo Errors

Debugging Commands

Share this guide

More CI/CD Troubleshooting Guides

Tekton Workspace Not Bound

Tekton TaskRun Timeout

Tekton PipelineRun Failed

Flux Source Not Ready

Flux Helm Release Failed

Flux Kustomization Not Applying