Fix CI/CD Pipeline Failures - Complete Deep Dive Guide

Introduction

CI/CD pipeline failures occur when automated build, test, or deployment workflows fail due to configuration errors, resource constraints, dependency issues, authentication failures, or infrastructure problems. Modern CI/CD platforms (GitHub Actions, GitLab CI, Jenkins, CircleCI, etc.) orchestrate complex workflows across multiple stages, each with potential failure points. Common causes include runner/agent unavailable or misconfigured, package dependencies failing to install, test failures blocking deployment, artifact upload/download failures, timeout exceeded for long-running jobs, secrets/authentication expired or invalid, disk space exhausted on runners, concurrent job limits reached, branch protection rules blocking deployment, and environment-specific configuration mismatches. The fix requires understanding CI/CD architecture, workflow configuration, debugging tools, and recovery procedures. This guide provides production-proven troubleshooting for CI/CD failures across GitHub Actions, GitLab CI, and Jenkins deployments.

Symptoms

Pipeline fails immediately with runner lost or agent unavailable
Error: Process completed with exit code 1
npm ERR! Could not resolve dependency
fatal: unable to access repository: SSL certificate problem
Error: No space left on device during build
Error: The operation was canceled (timeout)
403 Forbidden when pushing to registry
Error: Secrets are not available in pull requests from forks
Pipeline stuck in queued state indefinitely
Artifact expired or Artifact not found
Concurrency group canceled previous run
Deployment blocked by environment protection rules

Common Causes

Runner self-hosted agent offline or crashed
GitHub Actions/GitLab CI service outage
package.json, requirements.txt, or Gemfile has conflicting versions
NPM/Maven/PyPI registry unavailable or rate limited
Test suite has flaky tests failing intermittently
Docker build exceeds time limit (default 10 minutes per command)
Large artifacts exceeding storage limits
SSH keys, tokens, or certificates expired
Branch protection requiring reviews before merge
Environment variables or secrets not configured
Workspace/disk space exhausted on runner
Parallel job conflicts (database locks, port conflicts)
Webhook delivery failures preventing pipeline trigger

Step-by-Step Fix

### 1. Diagnose pipeline failures

Check pipeline logs:

```bash # GitHub Actions # Navigate to: Repository > Actions > Workflow Run > Job # Expand each step to see output

# Key sections: # - Set up job: Runner assignment, workspace cleanup # - Checkout: Repository clone status # - Setup [language]: Runtime installation # - Dependencies: Install status # - Build/Test: Compilation and test output # - Deploy: Deployment commands

# Download full logs via API gh run view <run-id> --log

# View recent runs gh run list --limit 10

# Check specific run gh run view <run-id>

# GitLab CI # Navigate to: Pipeline > Job > Trace # Or use CLI: glab ci trace <job-name>

# Download job artifacts glab ci artifacts download <job-name>

# Jenkins # Navigate to: Job > Build # > Console Output # Or use API: curl -u user:token http://jenkins.example.com/job/myjob/123/consoleText ```

Common error patterns:

```yaml # GitHub Actions error patterns:

# Runner lost # Error: The self-hosted runner: runner-1 lost communication with GitHub. # Cause: Runner process crashed, network interruption # Fix: Restart runner, check network connectivity

# Timeout # Error: The operation was canceled. # Cause: Job exceeded timeout-minutes # Fix: Optimize slow steps, increase timeout

# Dependency failure # npm ERR! ERESOLVE unable to resolve dependency tree # Cause: Conflicting package versions # Fix: Update package.json, use npm ci with lock file

# Artifact failure # Error: Artifact not found - {name: 'build-output'} # Cause: Artifact expired (default 90 days) or wrong name # Fix: Use correct artifact name, download before expiration

# Secrets in fork PR # Error: Secrets are not available in pull requests from fork repositories # Cause: Security restriction # Fix: Use workflow approval or avoid secrets in fork PR workflows ```

### 2. Fix GitHub Actions issues

Runner configuration:

```yaml # .github/workflows/ci.yml

# Using GitHub-hosted runners jobs: build: runs-on: ubuntu-latest # or windows-latest, macos-latest # runs-on: ubuntu-22.04 # Specific version

# Runner group (for organization runners) # runs-on: # group: my-runner-group # labels: # - self-hosted # - linux

steps: - uses: actions/checkout@v4

# Add timeout (default: 360 minutes / 6 hours) timeout-minutes: 30

# Retry flaky steps - uses: actions/checkout@v4 continue-on-error: true # Continue even if fails

# Conditional execution - if: github.ref == 'refs/heads/main' run: ./deploy.sh ```

Self-hosted runner setup:

```bash # Download and configure runner # GitHub > Settings > Actions > Runners > New self-hosted runner

# On runner machine: mkdir actions-runner && cd actions-runner curl -O -L https://github.com/actions/runner/releases/download/v2.311.0/actions-runner-linux-x64-2.311.0.tar.gz tar xzf ./actions-runner-linux-x64-2.311.0.tar.gz

# Configure ./config.sh --url https://github.com/org/repo --token TOKEN

# Start runner ./run.sh

# Run as service (systemd) cat > /etc/systemd/system/actions-runner.service << 'EOF' [Unit] Description=GitHub Actions Runner After=network.target

[Service] Type=simple User=runner WorkingDirectory=/home/runner/actions-runner ExecStart=/home/runner/actions-runner/run.sh Restart=always

[Install] WantedBy=multi-user.target EOF

systemctl daemon-reload systemctl enable actions-runner systemctl start actions-runner

# Check runner status systemctl status actions-runner

# Troubleshoot runner # Check runner logs tail -f /home/runner/actions-runner/_diag/Runner_*.log

# Verify connectivity curl -I https://github.com curl -I https://pipelines.actions.githubusercontent.com ```

Caching dependencies:

```yaml # Speed up builds and reduce failures with caching jobs: build: steps: - uses: actions/checkout@v4

# Node.js caching - name: Cache node modules uses: actions/cache@v3 with: path: ~/.npm key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} restore-keys: | ${{ runner.os }}-node-

# Python caching - name: Cache pip packages uses: actions/cache@v3 with: path: ~/.cache/pip key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }} restore-keys: | ${{ runner.os }}-pip-

# Docker layer caching - name: Cache Docker layers uses: actions/cache@v3 with: path: /tmp/.buildx-cache key: ${{ runner.os }}-buildx-${{ github.sha }} restore-keys: | ${{ runner.os }}-buildx- ```

Handle flaky tests:

```yaml # Retry flaky tests jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4

# Run tests with retry - name: Run tests with retry uses: nick-fields/retry@v2 with: timeout_minutes: 10 max_attempts: 3 command: npm test

# Or separate flaky tests - name: Run stable tests run: npm run test:stable

name: Run flaky tests (allowed to fail)
run: npm run test:flaky
continue-on-error: true
`

### 3. Fix GitLab CI issues

Runner configuration:

```yaml # .gitlab-ci.yml

# Specify runner tags build: tags: - docker - linux script: - docker build -t myapp .

# Using specific runner staging-deploy: tags: - staging-runner script: - ./deploy.sh staging

# Timeout configuration build: timeout: 30m # Job-level timeout script: - npm run build

# Retry configuration test: retry: max: 2 when: - runner_system_failure - stuck_or_timeout_failure script: - npm test ```

```bash # Install gitlab-runner # Debian/Ubuntu curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | bash apt-get install gitlab-runner

# RHEL/CentOS curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.rpm.sh | bash yum install gitlab-runner

# Register runner gitlab-runner register

# Input: # GitLab URL: https://gitlab.com/ # Registration token: Get from GitLab CI/CD settings # Runner description: my-runner # Tags: docker,linux # Executor: docker

# Run as service systemctl enable gitlab-runner systemctl start gitlab-runner

# Check runner status gitlab-runner status gitlab-runner list

# Troubleshoot journalctl -u gitlab-runner -f ```

Artifact management:

```yaml # Configure artifacts build: script: - npm run build artifacts: paths: - dist/ - build/ exclude: - dist/**/*.map expire_in: 1 week # Default: 30 days name: "build-$CI_COMMIT_SHA" reports: junit: test-results.xml coverage_report: coverage_format: cobertura path: coverage.xml

# Download artifacts from other jobs deploy: needs: - build script: - ls -la dist/ # Build artifacts available - ./deploy.sh

# Download artifacts from specific pipeline deploy-production: script: - apt-get install -y gitlab-cli - glab ci artifacts download --job build --ref main - ./deploy.sh ```

### 4. Fix Jenkins pipeline issues

Declarative pipeline:

```groovy // Jenkinsfile pipeline { agent { // Use specific label label 'docker-agent'

// Or use Kubernetes pod // kubernetes { // yaml ''' // spec: // containers: // - name: node // image: node:18 // command: // - cat // tty: true // ''' // } }

options { timeout(time: 30, unit: 'MINUTES') disableConcurrentBuilds() // One at a time timestamps() // Add timestamps to logs retry(3) // Retry entire pipeline }

environment { NODE_VERSION = '18' NPM_CONFIG_CACHE = "${WORKSPACE}/.npm-cache" }

stages { stage('Checkout') { steps { checkout scm } }

stage('Build') { steps { retry(2) { // Retry this stage sh 'npm ci' sh 'npm run build' } } }

stage('Test') { steps { // Continue even if tests fail catchError(buildResult: 'UNSTABLE', stageResult: 'UNSTABLE') { sh 'npm test' } } post { always { // Archive test results junit 'test-results/*.xml' } } }

stage('Deploy') { when { branch 'main' } steps { sh './deploy.sh' } } }

post { always { // Clean workspace cleanWs() } failure { // Notify on failure emailext subject: 'Build Failed', body: "Check ${BUILD_URL}", to: 'team@example.com' } } } ```

Jenkins agent configuration:

```bash # Check agent status # Jenkins UI > Manage Jenkins > Nodes

# Agent connection issues: # 1. Check agent JNLP connection # 2. Verify agent.jar running # 3. Check firewall allows JNLP port (default 50000)

# Restart agent # On agent machine: systemctl restart jenkins-agent

# Or relaunch from Jenkins UI # Nodes > [agent] > Relaunch

# Agent disk space issues # Check workspace du -sh /var/jenkins/workspace/* | sort -hr | head -20

# Clean old workspaces # Jenkins UI > Script Console Jenkins.instance.items.each { job -> job.builds.each { build -> if (build.timestamp < new Date() - 30) { build.delete() } } }

# Disk space threshold # Manage Jenkins > Configure System > Node Properties # Set: Delete workspace before build starts ```

Jenkins credentials:

```groovy // Access credentials in pipeline pipeline { agent any stages { stage('Deploy') { steps { // Username/password withCredentials([usernamePassword( credentialsId: 'docker-credentials', usernameVariable: 'DOCKER_USER', passwordVariable: 'DOCKER_PASS' )]) { sh 'docker login -u $DOCKER_USER -p $DOCKER_PASS' }

// SSH key withCredentials([sshUserPrivateKey( credentialsId: 'deploy-key', keyFileVariable: 'SSH_KEY' )]) { sh 'ssh -i $SSH_KEY deploy@server "./deploy.sh"' }

// Secret text (API token) withCredentials([string( credentialsId: 'api-token', variable: 'API_TOKEN' )]) { sh './deploy.sh --token $API_TOKEN' } } } } }

// Check credentials exist # Jenkins UI > Manage Jenkins > Credentials ```

### 5. Fix dependency installation failures

npm/yarn failures:

```yaml # GitHub Actions - Node.js - name: Setup Node.js uses: actions/setup-node@v4 with: node-version: '18' cache: 'npm'

name: Install dependencies
run: |
# Use ci for clean install from lock file
npm ci

# Or use legacy-peer-deps for conflicting packages npm ci --legacy-peer-deps

# Clean cache if corrupted npm cache clean --force npm ci

# Handle ERESOLVE errors - name: Install (with fallback) run: | npm ci || npm ci --legacy-peer-deps || npm install ```

Python/pip failures:

```yaml # GitHub Actions - Python - name: Setup Python uses: actions/setup-python@v4 with: python-version: '3.11' cache: 'pip'

name: Install dependencies
run: |
# Upgrade pip first
python -m pip install --upgrade pip

# Install from requirements pip install -r requirements.txt

# Or with cache pip install --cache-dir=.pip-cache -r requirements.txt

# Handle SSL certificate errors - name: Install (SSL workaround) run: | pip install --trusted-host pypi.org \ --trusted-host files.pythonhosted.org \ -r requirements.txt ```

Docker build failures:

```yaml # Optimize Docker builds - name: Build Docker image run: | # Use buildx for better caching docker buildx create --use

# Build with cache docker build \ --cache-from type=registry,ref=myapp:cache \ --cache-to type=registry,ref=myapp:cache,mode=max \ -t myapp:latest \ .

# Handle Docker rate limits # Use mirror or authenticate - name: Login to Docker Hub uses: docker/login-action@v3 with: username: ${{ secrets.DOCKER_USER }} password: ${{ secrets.DOCKER_PASS }}

# Multi-stage builds to reduce size # Dockerfile FROM node:18 AS builder WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build

FROM node:18-alpine WORKDIR /app COPY --from=builder /app/dist ./dist COPY --from=builder /app/node_modules ./node_modules CMD ["node", "dist/index.js"] ```

### 6. Fix timeout issues

Optimize slow pipelines:

```yaml # GitHub Actions - parallelize jobs: test: strategy: matrix: node: [16, 18, 20] os: [ubuntu-latest, windows-latest] runs-on: ${{ matrix.os }} steps: - uses: actions/checkout@v4 - run: npm test

# Run independent jobs in parallel lint: runs-on: ubuntu-latest steps: - run: npm run lint

security: runs-on: ubuntu-latest steps: - run: npm audit

# Increase timeout jobs: build: runs-on: ubuntu-latest timeout-minutes: 60 # Default: 360 steps: - run: ./slow-build.sh

# Skip unnecessary jobs jobs: deploy: runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' && github.event_name == 'push' steps: - run: ./deploy.sh ```

Debug slow steps:

```yaml # Add timing to steps - name: Build with timing run: | time npm run build

# Use action timing - uses: actions/checkout@v4 - name: Setup Node uses: actions/setup-node@v4 - name: Install run: npm ci - name: Build run: npm run build

# Check which step is slowest # GitHub Actions: Look at step duration in UI # Install cachetools action for timing report - uses: runs-on/cache-stats@v1 with: key-prefix: npm- ```

### 7. Monitor CI/CD health

GitHub Actions monitoring:

```yaml # Check workflow runs via API gh api /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs

# Get failed runs gh run list --status failure --limit 10

# Workflow run duration gh run view <run-id> --json durationMs

# Set up alerts for failures # GitHub > Settings > Notifications # Or use external monitoring ```

Prometheus metrics for Jenkins:

```yaml # Install Prometheus plugin in Jenkins # Manage Jenkins > Manage Plugins > Prometheus

# Metrics endpoint # http://jenkins.example.com/prometheus

# Key metrics: # default_jenkins_builds_duration_milliseconds_summary # default_jenkins_builds_last_build_result_ordinal # default_jenkins_builds_queued_duration_milliseconds # default_jenkins_queue_size_value # default_jenkins_nodes_executors_available

# Grafana alert rules groups: - name: cicd_health rules: - alert: JenkinsBuildFailureRate expr: | sum(rate(jenkins_builds_last_build_result{result="FAILURE"}[5m])) / sum(rate(jenkins_builds_last_build_result[5m])) > 0.2 for: 10m labels: severity: warning annotations: summary: "Jenkins build failure rate above 20%"

alert: JenkinsQueueSize
expr: jenkins_queue_size_value > 50
for: 15m
labels:
severity: warning
annotations:
summary: "Jenkins queue size growing"

alert: GitHubActionsFailureRate
expr: |
sum(rate(github_actions_workflow_runs_failed[5m]))
/
sum(rate(github_actions_workflow_runs_completed[5m]))
> 0.3
for: 10m
labels:
severity: warning
`

Prevention

Pin action/dependency versions to avoid breaking changes
Use caching for dependencies and build artifacts
Set appropriate timeouts based on historical run times
Implement retry logic for flaky tests and network operations
Regular cleanup of old builds, artifacts, and workspaces
Monitor runner disk space and resource utilization
Use matrix builds for parallel testing
Document common failure patterns and solutions
Set up alerts for pipeline failure rate increases
Test pipeline changes in staging before production

**403 Forbidden**: Authentication/authorization failure
**404 Not Found**: Resource (repo, artifact, workflow) doesn't exist
**500 Internal Server Error**: CI/CD platform server error
**502 Bad Gateway**: CI/CD platform proxy issue
**503 Service Unavailable**: CI/CD platform temporarily unavailable

How to Fix CI/CD Pipeline Failures - Complete Troubleshooting Guide

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Related Errors

Share this guide