Introduction
CI/CD pipeline failures occur when automated build, test, or deployment workflows fail due to configuration errors, resource constraints, dependency issues, authentication failures, or infrastructure problems. Modern CI/CD platforms (GitHub Actions, GitLab CI, Jenkins, CircleCI, etc.) orchestrate complex workflows across multiple stages, each with potential failure points. Common causes include runner/agent unavailable or misconfigured, package dependencies failing to install, test failures blocking deployment, artifact upload/download failures, timeout exceeded for long-running jobs, secrets/authentication expired or invalid, disk space exhausted on runners, concurrent job limits reached, branch protection rules blocking deployment, and environment-specific configuration mismatches. The fix requires understanding CI/CD architecture, workflow configuration, debugging tools, and recovery procedures. This guide provides production-proven troubleshooting for CI/CD failures across GitHub Actions, GitLab CI, and Jenkins deployments.
Symptoms
- Pipeline fails immediately with
runner lostoragent unavailable Error: Process completed with exit code 1npm ERR! Could not resolve dependencyfatal: unable to access repository: SSL certificate problemError: No space left on deviceduring buildError: The operation was canceled(timeout)403 Forbiddenwhen pushing to registryError: Secrets are not available in pull requests from forksPipeline stuck in queued stateindefinitelyArtifact expiredorArtifact not foundConcurrency group canceledprevious runDeployment blocked by environment protection rules
Common Causes
- Runner self-hosted agent offline or crashed
- GitHub Actions/GitLab CI service outage
- package.json, requirements.txt, or Gemfile has conflicting versions
- NPM/Maven/PyPI registry unavailable or rate limited
- Test suite has flaky tests failing intermittently
- Docker build exceeds time limit (default 10 minutes per command)
- Large artifacts exceeding storage limits
- SSH keys, tokens, or certificates expired
- Branch protection requiring reviews before merge
- Environment variables or secrets not configured
- Workspace/disk space exhausted on runner
- Parallel job conflicts (database locks, port conflicts)
- Webhook delivery failures preventing pipeline trigger
Step-by-Step Fix
### 1. Diagnose pipeline failures
Check pipeline logs:
```bash # GitHub Actions # Navigate to: Repository > Actions > Workflow Run > Job # Expand each step to see output
# Key sections: # - Set up job: Runner assignment, workspace cleanup # - Checkout: Repository clone status # - Setup [language]: Runtime installation # - Dependencies: Install status # - Build/Test: Compilation and test output # - Deploy: Deployment commands
# Download full logs via API gh run view <run-id> --log
# View recent runs gh run list --limit 10
# Check specific run gh run view <run-id>
# GitLab CI # Navigate to: Pipeline > Job > Trace # Or use CLI: glab ci trace <job-name>
# Download job artifacts glab ci artifacts download <job-name>
# Jenkins # Navigate to: Job > Build # > Console Output # Or use API: curl -u user:token http://jenkins.example.com/job/myjob/123/consoleText ```
Common error patterns:
```yaml # GitHub Actions error patterns:
# Runner lost # Error: The self-hosted runner: runner-1 lost communication with GitHub. # Cause: Runner process crashed, network interruption # Fix: Restart runner, check network connectivity
# Timeout # Error: The operation was canceled. # Cause: Job exceeded timeout-minutes # Fix: Optimize slow steps, increase timeout
# Dependency failure # npm ERR! ERESOLVE unable to resolve dependency tree # Cause: Conflicting package versions # Fix: Update package.json, use npm ci with lock file
# Artifact failure # Error: Artifact not found - {name: 'build-output'} # Cause: Artifact expired (default 90 days) or wrong name # Fix: Use correct artifact name, download before expiration
# Secrets in fork PR # Error: Secrets are not available in pull requests from fork repositories # Cause: Security restriction # Fix: Use workflow approval or avoid secrets in fork PR workflows ```
### 2. Fix GitHub Actions issues
Runner configuration:
```yaml # .github/workflows/ci.yml
# Using GitHub-hosted runners jobs: build: runs-on: ubuntu-latest # or windows-latest, macos-latest # runs-on: ubuntu-22.04 # Specific version
# Runner group (for organization runners) # runs-on: # group: my-runner-group # labels: # - self-hosted # - linux
steps: - uses: actions/checkout@v4
# Add timeout (default: 360 minutes / 6 hours) timeout-minutes: 30
# Retry flaky steps - uses: actions/checkout@v4 continue-on-error: true # Continue even if fails
# Conditional execution - if: github.ref == 'refs/heads/main' run: ./deploy.sh ```
Self-hosted runner setup:
```bash # Download and configure runner # GitHub > Settings > Actions > Runners > New self-hosted runner
# On runner machine: mkdir actions-runner && cd actions-runner curl -O -L https://github.com/actions/runner/releases/download/v2.311.0/actions-runner-linux-x64-2.311.0.tar.gz tar xzf ./actions-runner-linux-x64-2.311.0.tar.gz
# Configure ./config.sh --url https://github.com/org/repo --token TOKEN
# Start runner ./run.sh
# Run as service (systemd) cat > /etc/systemd/system/actions-runner.service << 'EOF' [Unit] Description=GitHub Actions Runner After=network.target
[Service] Type=simple User=runner WorkingDirectory=/home/runner/actions-runner ExecStart=/home/runner/actions-runner/run.sh Restart=always
[Install] WantedBy=multi-user.target EOF
systemctl daemon-reload systemctl enable actions-runner systemctl start actions-runner
# Check runner status systemctl status actions-runner
# Troubleshoot runner # Check runner logs tail -f /home/runner/actions-runner/_diag/Runner_*.log
# Verify connectivity curl -I https://github.com curl -I https://pipelines.actions.githubusercontent.com ```
Caching dependencies:
```yaml # Speed up builds and reduce failures with caching jobs: build: steps: - uses: actions/checkout@v4
# Node.js caching - name: Cache node modules uses: actions/cache@v3 with: path: ~/.npm key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} restore-keys: | ${{ runner.os }}-node-
# Python caching - name: Cache pip packages uses: actions/cache@v3 with: path: ~/.cache/pip key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }} restore-keys: | ${{ runner.os }}-pip-
# Docker layer caching - name: Cache Docker layers uses: actions/cache@v3 with: path: /tmp/.buildx-cache key: ${{ runner.os }}-buildx-${{ github.sha }} restore-keys: | ${{ runner.os }}-buildx- ```
Handle flaky tests:
```yaml # Retry flaky tests jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4
# Run tests with retry - name: Run tests with retry uses: nick-fields/retry@v2 with: timeout_minutes: 10 max_attempts: 3 command: npm test
# Or separate flaky tests - name: Run stable tests run: npm run test:stable
- name: Run flaky tests (allowed to fail)
- run: npm run test:flaky
- continue-on-error: true
`
### 3. Fix GitLab CI issues
Runner configuration:
```yaml # .gitlab-ci.yml
# Specify runner tags build: tags: - docker - linux script: - docker build -t myapp .
# Using specific runner staging-deploy: tags: - staging-runner script: - ./deploy.sh staging
# Timeout configuration build: timeout: 30m # Job-level timeout script: - npm run build
# Retry configuration test: retry: max: 2 when: - runner_system_failure - stuck_or_timeout_failure script: - npm test ```
Register GitLab runner:
```bash # Install gitlab-runner # Debian/Ubuntu curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | bash apt-get install gitlab-runner
# RHEL/CentOS curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.rpm.sh | bash yum install gitlab-runner
# Register runner gitlab-runner register
# Input: # GitLab URL: https://gitlab.com/ # Registration token: Get from GitLab CI/CD settings # Runner description: my-runner # Tags: docker,linux # Executor: docker
# Run as service systemctl enable gitlab-runner systemctl start gitlab-runner
# Check runner status gitlab-runner status gitlab-runner list
# Troubleshoot journalctl -u gitlab-runner -f ```
Artifact management:
```yaml # Configure artifacts build: script: - npm run build artifacts: paths: - dist/ - build/ exclude: - dist/**/*.map expire_in: 1 week # Default: 30 days name: "build-$CI_COMMIT_SHA" reports: junit: test-results.xml coverage_report: coverage_format: cobertura path: coverage.xml
# Download artifacts from other jobs deploy: needs: - build script: - ls -la dist/ # Build artifacts available - ./deploy.sh
# Download artifacts from specific pipeline deploy-production: script: - apt-get install -y gitlab-cli - glab ci artifacts download --job build --ref main - ./deploy.sh ```
### 4. Fix Jenkins pipeline issues
Declarative pipeline:
```groovy // Jenkinsfile pipeline { agent { // Use specific label label 'docker-agent'
// Or use Kubernetes pod // kubernetes { // yaml ''' // spec: // containers: // - name: node // image: node:18 // command: // - cat // tty: true // ''' // } }
options { timeout(time: 30, unit: 'MINUTES') disableConcurrentBuilds() // One at a time timestamps() // Add timestamps to logs retry(3) // Retry entire pipeline }
environment { NODE_VERSION = '18' NPM_CONFIG_CACHE = "${WORKSPACE}/.npm-cache" }
stages { stage('Checkout') { steps { checkout scm } }
stage('Build') { steps { retry(2) { // Retry this stage sh 'npm ci' sh 'npm run build' } } }
stage('Test') { steps { // Continue even if tests fail catchError(buildResult: 'UNSTABLE', stageResult: 'UNSTABLE') { sh 'npm test' } } post { always { // Archive test results junit 'test-results/*.xml' } } }
stage('Deploy') { when { branch 'main' } steps { sh './deploy.sh' } } }
post { always { // Clean workspace cleanWs() } failure { // Notify on failure emailext subject: 'Build Failed', body: "Check ${BUILD_URL}", to: 'team@example.com' } } } ```
Jenkins agent configuration:
```bash # Check agent status # Jenkins UI > Manage Jenkins > Nodes
# Agent connection issues: # 1. Check agent JNLP connection # 2. Verify agent.jar running # 3. Check firewall allows JNLP port (default 50000)
# Restart agent # On agent machine: systemctl restart jenkins-agent
# Or relaunch from Jenkins UI # Nodes > [agent] > Relaunch
# Agent disk space issues # Check workspace du -sh /var/jenkins/workspace/* | sort -hr | head -20
# Clean old workspaces # Jenkins UI > Script Console Jenkins.instance.items.each { job -> job.builds.each { build -> if (build.timestamp < new Date() - 30) { build.delete() } } }
# Disk space threshold # Manage Jenkins > Configure System > Node Properties # Set: Delete workspace before build starts ```
Jenkins credentials:
```groovy // Access credentials in pipeline pipeline { agent any stages { stage('Deploy') { steps { // Username/password withCredentials([usernamePassword( credentialsId: 'docker-credentials', usernameVariable: 'DOCKER_USER', passwordVariable: 'DOCKER_PASS' )]) { sh 'docker login -u $DOCKER_USER -p $DOCKER_PASS' }
// SSH key withCredentials([sshUserPrivateKey( credentialsId: 'deploy-key', keyFileVariable: 'SSH_KEY' )]) { sh 'ssh -i $SSH_KEY deploy@server "./deploy.sh"' }
// Secret text (API token) withCredentials([string( credentialsId: 'api-token', variable: 'API_TOKEN' )]) { sh './deploy.sh --token $API_TOKEN' } } } } }
// Check credentials exist # Jenkins UI > Manage Jenkins > Credentials ```
### 5. Fix dependency installation failures
npm/yarn failures:
```yaml # GitHub Actions - Node.js - name: Setup Node.js uses: actions/setup-node@v4 with: node-version: '18' cache: 'npm'
- name: Install dependencies
- run: |
- # Use ci for clean install from lock file
- npm ci
# Or use legacy-peer-deps for conflicting packages npm ci --legacy-peer-deps
# Clean cache if corrupted npm cache clean --force npm ci
# Handle ERESOLVE errors - name: Install (with fallback) run: | npm ci || npm ci --legacy-peer-deps || npm install ```
Python/pip failures:
```yaml # GitHub Actions - Python - name: Setup Python uses: actions/setup-python@v4 with: python-version: '3.11' cache: 'pip'
- name: Install dependencies
- run: |
- # Upgrade pip first
- python -m pip install --upgrade pip
# Install from requirements pip install -r requirements.txt
# Or with cache pip install --cache-dir=.pip-cache -r requirements.txt
# Handle SSL certificate errors - name: Install (SSL workaround) run: | pip install --trusted-host pypi.org \ --trusted-host files.pythonhosted.org \ -r requirements.txt ```
Docker build failures:
```yaml # Optimize Docker builds - name: Build Docker image run: | # Use buildx for better caching docker buildx create --use
# Build with cache docker build \ --cache-from type=registry,ref=myapp:cache \ --cache-to type=registry,ref=myapp:cache,mode=max \ -t myapp:latest \ .
# Handle Docker rate limits # Use mirror or authenticate - name: Login to Docker Hub uses: docker/login-action@v3 with: username: ${{ secrets.DOCKER_USER }} password: ${{ secrets.DOCKER_PASS }}
# Multi-stage builds to reduce size # Dockerfile FROM node:18 AS builder WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build
FROM node:18-alpine WORKDIR /app COPY --from=builder /app/dist ./dist COPY --from=builder /app/node_modules ./node_modules CMD ["node", "dist/index.js"] ```
### 6. Fix timeout issues
Optimize slow pipelines:
```yaml # GitHub Actions - parallelize jobs: test: strategy: matrix: node: [16, 18, 20] os: [ubuntu-latest, windows-latest] runs-on: ${{ matrix.os }} steps: - uses: actions/checkout@v4 - run: npm test
# Run independent jobs in parallel lint: runs-on: ubuntu-latest steps: - run: npm run lint
security: runs-on: ubuntu-latest steps: - run: npm audit
# Increase timeout jobs: build: runs-on: ubuntu-latest timeout-minutes: 60 # Default: 360 steps: - run: ./slow-build.sh
# Skip unnecessary jobs jobs: deploy: runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' && github.event_name == 'push' steps: - run: ./deploy.sh ```
Debug slow steps:
```yaml # Add timing to steps - name: Build with timing run: | time npm run build
# Use action timing - uses: actions/checkout@v4 - name: Setup Node uses: actions/setup-node@v4 - name: Install run: npm ci - name: Build run: npm run build
# Check which step is slowest # GitHub Actions: Look at step duration in UI # Install cachetools action for timing report - uses: runs-on/cache-stats@v1 with: key-prefix: npm- ```
### 7. Monitor CI/CD health
GitHub Actions monitoring:
```yaml # Check workflow runs via API gh api /repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs
# Get failed runs gh run list --status failure --limit 10
# Workflow run duration gh run view <run-id> --json durationMs
# Set up alerts for failures # GitHub > Settings > Notifications # Or use external monitoring ```
Prometheus metrics for Jenkins:
```yaml # Install Prometheus plugin in Jenkins # Manage Jenkins > Manage Plugins > Prometheus
# Metrics endpoint # http://jenkins.example.com/prometheus
# Key metrics: # default_jenkins_builds_duration_milliseconds_summary # default_jenkins_builds_last_build_result_ordinal # default_jenkins_builds_queued_duration_milliseconds # default_jenkins_queue_size_value # default_jenkins_nodes_executors_available
# Grafana alert rules groups: - name: cicd_health rules: - alert: JenkinsBuildFailureRate expr: | sum(rate(jenkins_builds_last_build_result{result="FAILURE"}[5m])) / sum(rate(jenkins_builds_last_build_result[5m])) > 0.2 for: 10m labels: severity: warning annotations: summary: "Jenkins build failure rate above 20%"
- alert: JenkinsQueueSize
- expr: jenkins_queue_size_value > 50
- for: 15m
- labels:
- severity: warning
- annotations:
- summary: "Jenkins queue size growing"
- alert: GitHubActionsFailureRate
- expr: |
- sum(rate(github_actions_workflow_runs_failed[5m]))
- /
- sum(rate(github_actions_workflow_runs_completed[5m]))
- > 0.3
- for: 10m
- labels:
- severity: warning
`
Prevention
- Pin action/dependency versions to avoid breaking changes
- Use caching for dependencies and build artifacts
- Set appropriate timeouts based on historical run times
- Implement retry logic for flaky tests and network operations
- Regular cleanup of old builds, artifacts, and workspaces
- Monitor runner disk space and resource utilization
- Use matrix builds for parallel testing
- Document common failure patterns and solutions
- Set up alerts for pipeline failure rate increases
- Test pipeline changes in staging before production
Related Errors
- **403 Forbidden**: Authentication/authorization failure
- **404 Not Found**: Resource (repo, artifact, workflow) doesn't exist
- **500 Internal Server Error**: CI/CD platform server error
- **502 Bad Gateway**: CI/CD platform proxy issue
- **503 Service Unavailable**: CI/CD platform temporarily unavailable