Introduction

CI/CD pipeline failures occur when automated build, test, or deployment workflows fail due to configuration errors, resource constraints, dependency issues, authentication failures, or infrastructure problems. Modern CI/CD platforms (GitHub Actions, GitLab CI, Jenkins, CircleCI, etc.) orchestrate complex workflows across multiple stages, each with potential failure points. Common causes include runner/agent unavailable or misconfigured, package dependencies failing to install, test failures blocking deployment, artifact upload/download failures, timeout exceeded for long-running jobs, secrets/authentication expired or invalid, disk space exhausted on runners, concurrent job limits reached, branch protection rules blocking deployment, and environment-specific configuration mismatches. The fix requires understanding CI/CD architecture, workflow configuration, debugging tools, and recovery procedures. This guide provides production-proven troubleshooting for CI/CD failures across GitHub Actions, GitLab CI, and Jenkins deployments.

Symptoms

  • Pipeline fails immediately with runner lost or agent unavailable
  • Error: Process completed with exit code 1
  • npm ERR! Could not resolve dependency
  • fatal: unable to access repository: SSL certificate problem
  • Error: No space left on device during build
  • Error: The operation was canceled (timeout)
  • 403 Forbidden when pushing to registry
  • Error: Secrets are not available in pull requests from forks
  • Pipeline stuck in queued state indefinitely
  • Artifact expired or Artifact not found
  • Concurrency group canceled previous run
  • Deployment blocked by environment protection rules

Common Causes

  • Runner self-hosted agent offline or crashed
  • GitHub Actions/GitLab CI service outage
  • package.json, requirements.txt, or Gemfile has conflicting versions
  • NPM/Maven/PyPI registry unavailable or rate limited
  • Test suite has flaky tests failing intermittently
  • Docker build exceeds time limit (default 10 minutes per command)
  • Large artifacts exceeding storage limits
  • SSH keys, tokens, or certificates expired
  • Branch protection requiring reviews before merge
  • Environment variables or secrets not configured
  • Workspace/disk space exhausted on runner
  • Parallel job conflicts (database locks, port conflicts)
  • Webhook delivery failures preventing pipeline trigger

Step-by-Step Fix

### 1. Diagnose pipeline failures

Check pipeline logs:

```bash # GitHub Actions # Navigate to: Repository > Actions > Workflow Run > Job # Expand each step to see output

# Key sections: # - Set up job: Runner assignment, workspace cleanup # - Checkout: Repository clone status # - Setup [language]: Runtime installation # - Dependencies: Install status # - Build/Test: Compilation and test output # - Deploy: Deployment commands

# Download full logs via API gh run view <run-id> --log

# View recent runs gh run list --limit 10

# Check specific run gh run view <run-id>

# GitLab CI # Navigate to: Pipeline > Job > Trace # Or use CLI: glab ci trace <job-name>

# Download job artifacts glab ci artifacts download <job-name>

# Jenkins # Navigate to: Job > Build # > Console Output # Or use API: curl -u user:token http://jenkins.example.com/job/myjob/123/consoleText ```

Common error patterns:

```yaml # GitHub Actions error patterns:

# Runner lost # Error: The self-hosted runner: runner-1 lost communication with GitHub. # Cause: Runner process crashed, network interruption # Fix: Restart runner, check network connectivity

# Timeout # Error: The operation was canceled. # Cause: Job exceeded timeout-minutes # Fix: Optimize slow steps, increase timeout

# Dependency failure # npm ERR! ERESOLVE unable to resolve dependency tree # Cause: Conflicting package versions # Fix: Update package.json, use npm ci with lock file

# Artifact failure # Error: Artifact not found - {name: 'build-output'} # Cause: Artifact expired (default 90 days) or wrong name # Fix: Use correct artifact name, download before expiration

# Secrets in fork PR # Error: Secrets are not available in pull requests from fork repositories # Cause: Security restriction # Fix: Use workflow approval or avoid secrets in fork PR workflows ```

### 2. Fix GitHub Actions issues

Runner configuration:

```yaml # .github/workflows/ci.yml

# Using GitHub-hosted runners jobs: build: runs-on: ubuntu-latest # or windows-latest, macos-latest # runs-on: ubuntu-22.04 # Specific version

# Runner group (for organization runners) # runs-on: # group: my-runner-group # labels: # - self-hosted # - linux

steps: - uses: actions/checkout@v4

# Add timeout (default: 360 minutes / 6 hours) timeout-minutes: 30

# Retry flaky steps - uses: actions/checkout@v4 continue-on-error: true # Continue even if fails

# Conditional execution - if: github.ref == 'refs/heads/main' run: ./deploy.sh ```

Self-hosted runner setup:

```bash # Download and configure runner # GitHub > Settings > Actions > Runners > New self-hosted runner

# On runner machine: mkdir actions-runner && cd actions-runner curl -O -L https://github.com/actions/runner/releases/download/v2.311.0/actions-runner-linux-x64-2.311.0.tar.gz tar xzf ./actions-runner-linux-x64-2.311.0.tar.gz

# Configure ./config.sh --url https://github.com/org/repo --token TOKEN

# Start runner ./run.sh

# Run as service (systemd) cat > /etc/systemd/system/actions-runner.service << 'EOF' [Unit] Description=GitHub Actions Runner After=network.target

[Service] Type=simple User=runner WorkingDirectory=/home/runner/actions-runner ExecStart=/home/runner/actions-runner/run.sh Restart=always

[Install] WantedBy=multi-user.target EOF

systemctl daemon-reload systemctl enable actions-runner systemctl start actions-runner

# Check runner status systemctl status actions-runner

# Troubleshoot runner # Check runner logs tail -f /home/runner/actions-runner/_diag/Runner_*.log

# Verify connectivity curl -I https://github.com curl -I https://pipelines.actions.githubusercontent.com ```

Caching dependencies:

```yaml # Speed up builds and reduce failures with caching jobs: build: steps: - uses: actions/checkout@v4

# Node.js caching - name: Cache node modules uses: actions/cache@v3 with: path: ~/.npm key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} restore-keys: | ${{ runner.os }}-node-

# Python caching - name: Cache pip packages uses: actions/cache@v3 with: path: ~/.cache/pip key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }} restore-keys: | ${{ runner.os }}-pip-

# Docker layer caching - name: Cache Docker layers uses: actions/cache@v3 with: path: /tmp/.buildx-cache key: ${{ runner.os }}-buildx-${{ github.sha }} restore-keys: | ${{ runner.os }}-buildx- ```

Handle flaky tests:

```yaml # Retry flaky tests jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4

# Run tests with retry - name: Run tests with retry uses: nick-fields/retry@v2 with: timeout_minutes: 10 max_attempts: 3 command: npm test

# Or separate flaky tests - name: Run stable tests run: npm run test:stable

  • name: Run flaky tests (allowed to fail)
  • run: npm run test:flaky
  • continue-on-error: true
  • `

### 3. Fix GitLab CI issues

Runner configuration:

```yaml # .gitlab-ci.yml

# Specify runner tags build: tags: - docker - linux script: - docker build -t myapp .

# Using specific runner staging-deploy: tags: - staging-runner script: - ./deploy.sh staging

# Timeout configuration build: timeout: 30m # Job-level timeout script: - npm run build

# Retry configuration test: retry: max: 2 when: - runner_system_failure - stuck_or_timeout_failure script: - npm test ```

Register GitLab runner:

```bash # Install gitlab-runner # Debian/Ubuntu curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | bash apt-get install gitlab-runner

# RHEL/CentOS curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.rpm.sh | bash yum install gitlab-runner

# Register runner gitlab-runner register

# Input: # GitLab URL: https://gitlab.com/ # Registration token: Get from GitLab CI/CD settings # Runner description: my-runner # Tags: docker,linux # Executor: docker

# Run as service systemctl enable gitlab-runner systemctl start gitlab-runner

# Check runner status gitlab-runner status gitlab-runner list

# Troubleshoot journalctl -u gitlab-runner -f ```

Artifact management:

```yaml # Configure artifacts build: script: - npm run build artifacts: paths: - dist/ - build/ exclude: - dist/**/*.map expire_in: 1 week # Default: 30 days name: "build-$CI_COMMIT_SHA" reports: junit: test-results.xml coverage_report: coverage_format: cobertura path: coverage.xml

# Download artifacts from other jobs deploy: needs: - build script: - ls -la dist/ # Build artifacts available - ./deploy.sh

# Download artifacts from specific pipeline deploy-production: script: - apt-get install -y gitlab-cli - glab ci artifacts download --job build --ref main - ./deploy.sh ```

### 4. Fix Jenkins pipeline issues

Declarative pipeline:

```groovy // Jenkinsfile pipeline { agent { // Use specific label label 'docker-agent'

// Or use Kubernetes pod // kubernetes { // yaml ''' // spec: // containers: // - name: node // image: node:18 // command: // - cat // tty: true // ''' // } }

options { timeout(time: 30, unit: 'MINUTES') disableConcurrentBuilds() // One at a time timestamps() // Add timestamps to logs retry(3) // Retry entire pipeline }

environment { NODE_VERSION = '18' NPM_CONFIG_CACHE = "${WORKSPACE}/.npm-cache" }

stages { stage('Checkout') { steps { checkout scm } }

stage('Build') { steps { retry(2) { // Retry this stage sh 'npm ci' sh 'npm run build' } } }

stage('Test') { steps { // Continue even if tests fail catchError(buildResult: 'UNSTABLE', stageResult: 'UNSTABLE') { sh 'npm test' } } post { always { // Archive test results junit 'test-results/*.xml' } } }

stage('Deploy') { when { branch 'main' } steps { sh './deploy.sh' } } }

post { always { // Clean workspace cleanWs() } failure { // Notify on failure emailext subject: 'Build Failed', body: "Check ${BUILD_URL}", to: 'team@example.com' } } } ```

Jenkins agent configuration:

```bash # Check agent status # Jenkins UI > Manage Jenkins > Nodes

# Agent connection issues: # 1. Check agent JNLP connection # 2. Verify agent.jar running # 3. Check firewall allows JNLP port (default 50000)

# Restart agent # On agent machine: systemctl restart jenkins-agent

# Or relaunch from Jenkins UI # Nodes > [agent] > Relaunch

# Agent disk space issues # Check workspace du -sh /var/jenkins/workspace/* | sort -hr | head -20

# Clean old workspaces # Jenkins UI > Script Console Jenkins.instance.items.each { job -> job.builds.each { build -> if (build.timestamp < new Date() - 30) { build.delete() } } }

# Disk space threshold # Manage Jenkins > Configure System > Node Properties # Set: Delete workspace before build starts ```

Jenkins credentials:

```groovy // Access credentials in pipeline pipeline { agent any stages { stage('Deploy') { steps { // Username/password withCredentials([usernamePassword( credentialsId: 'docker-credentials', usernameVariable: 'DOCKER_USER', passwordVariable: 'DOCKER_PASS' )]) { sh 'docker login -u $DOCKER_USER -p $DOCKER_PASS' }

// SSH key withCredentials([sshUserPrivateKey( credentialsId: 'deploy-key', keyFileVariable: 'SSH_KEY' )]) { sh 'ssh -i $SSH_KEY deploy@server "./deploy.sh"' }

// Secret text (API token) withCredentials([string( credentialsId: 'api-token', variable: 'API_TOKEN' )]) { sh './deploy.sh --token $API_TOKEN' } } } } }

// Check credentials exist # Jenkins UI > Manage Jenkins > Credentials ```

### 5. Fix dependency installation failures

npm/yarn failures:

```yaml # GitHub Actions - Node.js - name: Setup Node.js uses: actions/setup-node@v4 with: node-version: '18' cache: 'npm'

  • name: Install dependencies
  • run: |
  • # Use ci for clean install from lock file
  • npm ci

# Or use legacy-peer-deps for conflicting packages npm ci --legacy-peer-deps

# Clean cache if corrupted npm cache clean --force npm ci

# Handle ERESOLVE errors - name: Install (with fallback) run: | npm ci || npm ci --legacy-peer-deps || npm install ```

Python/pip failures:

```yaml # GitHub Actions - Python - name: Setup Python uses: actions/setup-python@v4 with: python-version: '3.11' cache: 'pip'

  • name: Install dependencies
  • run: |
  • # Upgrade pip first
  • python -m pip install --upgrade pip

# Install from requirements pip install -r requirements.txt

# Or with cache pip install --cache-dir=.pip-cache -r requirements.txt

# Handle SSL certificate errors - name: Install (SSL workaround) run: | pip install --trusted-host pypi.org \ --trusted-host files.pythonhosted.org \ -r requirements.txt ```

Docker build failures:

```yaml # Optimize Docker builds - name: Build Docker image run: | # Use buildx for better caching docker buildx create --use

# Build with cache docker build \ --cache-from type=registry,ref=myapp:cache \ --cache-to type=registry,ref=myapp:cache,mode=max \ -t myapp:latest \ .

# Handle Docker rate limits # Use mirror or authenticate - name: Login to Docker Hub uses: docker/login-action@v3 with: username: ${{ secrets.DOCKER_USER }} password: ${{ secrets.DOCKER_PASS }}

# Multi-stage builds to reduce size # Dockerfile FROM node:18 AS builder WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build

FROM node:18-alpine WORKDIR /app COPY --from=builder /app/dist ./dist COPY --from=builder /app/node_modules ./node_modules CMD ["node", "dist/index.js"] ```

### 6. Fix timeout issues

Optimize slow pipelines:

```yaml # GitHub Actions - parallelize jobs: test: strategy: matrix: node: [16, 18, 20] os: [ubuntu-latest, windows-latest] runs-on: ${{ matrix.os }} steps: - uses: actions/checkout@v4 - run: npm test

# Run independent jobs in parallel lint: runs-on: ubuntu-latest steps: - run: npm run lint

security: runs-on: ubuntu-latest steps: - run: npm audit

# Increase timeout jobs: build: runs-on: ubuntu-latest timeout-minutes: 60 # Default: 360 steps: - run: ./slow-build.sh

# Skip unnecessary jobs jobs: deploy: runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' && github.event_name == 'push' steps: - run: ./deploy.sh ```

Debug slow steps:

```yaml # Add timing to steps - name: Build with timing run: | time npm run build

# Use action timing - uses: actions/checkout@v4 - name: Setup Node uses: actions/setup-node@v4 - name: Install run: npm ci - name: Build run: npm run build

# Check which step is slowest # GitHub Actions: Look at step duration in UI # Install cachetools action for timing report - uses: runs-on/cache-stats@v1 with: key-prefix: npm- ```

### 7. Monitor CI/CD health

GitHub Actions monitoring:

```yaml # Check workflow runs via API gh api /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs

# Get failed runs gh run list --status failure --limit 10

# Workflow run duration gh run view <run-id> --json durationMs

# Set up alerts for failures # GitHub > Settings > Notifications # Or use external monitoring ```

Prometheus metrics for Jenkins:

```yaml # Install Prometheus plugin in Jenkins # Manage Jenkins > Manage Plugins > Prometheus

# Metrics endpoint # http://jenkins.example.com/prometheus

# Key metrics: # default_jenkins_builds_duration_milliseconds_summary # default_jenkins_builds_last_build_result_ordinal # default_jenkins_builds_queued_duration_milliseconds # default_jenkins_queue_size_value # default_jenkins_nodes_executors_available

# Grafana alert rules groups: - name: cicd_health rules: - alert: JenkinsBuildFailureRate expr: | sum(rate(jenkins_builds_last_build_result{result="FAILURE"}[5m])) / sum(rate(jenkins_builds_last_build_result[5m])) > 0.2 for: 10m labels: severity: warning annotations: summary: "Jenkins build failure rate above 20%"

  • alert: JenkinsQueueSize
  • expr: jenkins_queue_size_value > 50
  • for: 15m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Jenkins queue size growing"
  • alert: GitHubActionsFailureRate
  • expr: |
  • sum(rate(github_actions_workflow_runs_failed[5m]))
  • /
  • sum(rate(github_actions_workflow_runs_completed[5m]))
  • > 0.3
  • for: 10m
  • labels:
  • severity: warning
  • `

Prevention

  • Pin action/dependency versions to avoid breaking changes
  • Use caching for dependencies and build artifacts
  • Set appropriate timeouts based on historical run times
  • Implement retry logic for flaky tests and network operations
  • Regular cleanup of old builds, artifacts, and workspaces
  • Monitor runner disk space and resource utilization
  • Use matrix builds for parallel testing
  • Document common failure patterns and solutions
  • Set up alerts for pipeline failure rate increases
  • Test pipeline changes in staging before production
  • **403 Forbidden**: Authentication/authorization failure
  • **404 Not Found**: Resource (repo, artifact, workflow) doesn't exist
  • **500 Internal Server Error**: CI/CD platform server error
  • **502 Bad Gateway**: CI/CD platform proxy issue
  • **503 Service Unavailable**: CI/CD platform temporarily unavailable