What's Actually Happening
Nomad fails to allocate tasks to client nodes, or allocated tasks fail to start. Jobs remain in pending state or allocations are marked as failed.
The Error You'll See
Allocation failed:
```bash $ nomad job status myjob
Status = running Allocations = 2 failed, 1 running
ID Node ID Task Group Version Status Failed abc123 node-1 web 1 failed true ```
Task failed:
```bash $ nomad alloc-status abc123
Task States: Name State Started Finished Message web failed 12:00:00 12:00:01 Failed to start task
Last Error: failed to start: docker driver: container exited immediately ```
No eligible clients:
```bash $ nomad job eval myjob
Warning: job has no eligible clients for allocation ```
Why This Happens
- 1.Constraint mismatch - No node matches job constraints
- 2.Insufficient resources - Not enough CPU/memory on clients
- 3.Driver issues - Task driver not available or misconfigured
- 4.Network problems - Port conflicts or connectivity issues
- 5.Artifact fetch failure - Cannot download task artifacts
- 6.Task configuration - Invalid task specification
Step 1: Check Job Status
```bash # Get job status: nomad job status myjob
# Get allocation details: nomad alloc-status <alloc-id>
# Check task logs: nomad logs <alloc-id> web
# Check stderr: nomad logs -stderr <alloc-id> web
# Check allocation events: nomad alloc-status -verbose <alloc-id>
# Check evaluation: nomad eval status <eval-id>
# Check job spec: nomad job inspect myjob ```
Step 2: Check Client Availability
```bash # List all clients: nomad node status
# Check client status: nomad node status <node-id>
# Check client eligibility: nomad node eligibility <node-id>
# Check node resources: nomad node status -verbose <node-id> | grep -A 10 "Node Resources"
# Check allocated resources: nomad node status -verbose <node-id> | grep -A 10 "Allocated Resources"
# Check client drivers: nomad node status <node-id> | grep -A 5 "Drivers"
# Check drain status: nomad node drain -status <node-id> ```
Step 3: Check Constraints
```hcl # Check job constraints: job "myjob" { constraint { attribute = "${attr.kernel.name}" value = "linux" }
constraint { attribute = "${attr.cpu.arch}" value = "amd64" }
# Node class constraint: constraint { attribute = "${meta.class}" value = "worker" } }
# Check client meta attributes: nomad node status <node-id> | grep -A 10 "Meta"
# Common constraint issues: # 1. Missing attribute on client: # Client doesn't have meta.class = "worker"
# 2. Wrong attribute value: # Constraint expects amd64 but node is arm64
# Fix by adding meta to client config: # In client.hcl: meta { class = "worker" } ```
Step 4: Check Resource Requirements
```hcl # Check task resources: task "web" { driver = "docker"
resources { cpu = 500 # MHz memory = 512 # MB network { mbits = 10 port "http" {} } } }
# Check if clients have enough resources: nomad node status -verbose <node-id>
# Output shows: # CPU: 2000 MHz total, 1500 MHz allocated # Memory: 4096 MB total, 3000 MB allocated
# If resources exhausted: # 1. Reduce task requirements: resources { cpu = 200 # Lower CPU memory = 256 # Lower memory }
# 2. Add more clients: # Scale out the Nomad cluster ```
Step 5: Check Task Driver
```bash # Check available drivers on client: nomad node status <node-id> | grep -A 20 "Drivers"
# Output: # Driver Detected Healthy Message # docker true true Driver running # exec true true Driver running # qemu false false Driver not found
# Check driver configuration: # In client.hcl: client { options = { "driver.allowlist" = "docker,exec" } }
# Restart client after driver config change: systemctl restart nomad
# Check driver logs: journalctl -u nomad | grep -i driver
# Test driver manually: nomad job run test-docker.nomad ```
Step 6: Check Artifacts
```hcl # Check artifact configuration: task "web" { driver = "docker"
artifact { source = "https://releases.example.com/app-v1.tar.gz" destination = "local/app.tar.gz" mode = "file" } }
# Test artifact URL: curl -I https://releases.example.com/app-v1.tar.gz
# Check artifact download logs: nomad logs <alloc-id> web | grep -i artifact
# Common artifact issues: # 1. URL not accessible # 2. Authentication required # 3. Certificate issues
# Add artifact headers for auth: artifact { source = "https://private.example.com/app.tar.gz" headers = { Authorization = "Bearer token123" } }
# Skip TLS verify (not recommended for prod): artifact { source = "https://internal.example.com/app.tar.gz" mode = "file" options = { "skip_verify" = "true" } } ```
Step 7: Check Network Configuration
```hcl # Check network resource: resources { network { mbits = 10 port "http" { static = 8080 # Static port } port "admin" {} # Dynamic port } }
# Port conflicts occur when: # 1. Multiple tasks use same static port # 2. Port already in use on client
# Use dynamic ports instead: port "http" {} # Nomad assigns port
# Check allocated ports: nomad alloc-status <alloc-id> | grep -A 5 "Network"
# For Docker task: config { image = "nginx" ports = ["http"] }
# Port mapping: config { image = "nginx" port_map { http = 80 } } ```
Step 8: Check Task Configuration
```hcl # Common Docker task issues: task "web" { driver = "docker"
config { image = "nginx:latest"
# Check image exists: # docker pull nginx:latest
# Check command: command = "/bin/sh" args = ["-c", "nginx -g 'daemon off;'"]
# Check volumes: volumes = [ "local/config:/etc/nginx/conf.d" ]
# Check environment: env = { NODE_ENV = "production" }
# Check user: user = "nginx" } }
# Validate job spec: nomad job validate myjob.nomad
# Test job with dry-run: nomad job plan myjob.nomad
# Check job parsing: nomad job inspect myjob | jq .Job.TaskGroups[].Tasks[].Config ```
Step 9: Check Client Logs
```bash # Check Nomad client logs: journalctl -u nomad -f
# Look for specific errors: journalctl -u nomad | grep -i "error|failed|allocation"
# Check task driver logs: journalctl -u nomad | grep -i "driver|docker|exec"
# Check allocation events on client: nomad alloc-status -verbose <alloc-id>
# Check client health: nomad node health <node-id>
# Restart Nomad client: systemctl restart nomad
# Check client reconnected: nomad node status <node-id> ```
Step 10: Monitor Allocations
```bash # Create monitoring script: cat << 'EOF' > /usr/local/bin/monitor-nomad.sh #!/bin/bash
echo "=== Job Status ===" nomad job status
echo "" echo "=== Failed Allocations ===" nomad job status -verbose | grep -i failed
echo "" echo "=== Node Status ===" nomad node status
echo "" echo "=== Cluster Metrics ===" nomad operator metrics | grep -E "nomad_(allocations|jobs|nodes)"
echo "" echo "=== Pending Evaluations ===" nomad eval list -json | jq '.[] | select(.Status == "pending")'
echo "" echo "=== Resource Usage ===" nomad node status -verbose $(nomad node status -self) | grep -A 5 "Allocated" EOF
chmod +x /usr/local/bin/monitor-nomad.sh
# Prometheus metrics: curl http://localhost:4646/v1/metrics | jq
# Key metrics: # nomad_client_allocations # nomad_client_unallocated_cpu # nomad_client_unallocated_memory
# Alerts: - alert: NomadAllocationFailed expr: rate(nomad_client_allocations_failed_total[5m]) > 0 for: 2m labels: severity: warning annotations: summary: "Nomad allocation failures detected" ```
Nomad Allocation Failed Checklist
| Check | Command | Expected |
|---|---|---|
| Job status | nomad job status | Running |
| Allocation | nomad alloc-status | Running |
| Constraints | job inspect | Match clients |
| Resources | node status | Available |
| Drivers | node status | Healthy |
| Network | alloc-status | No conflicts |
Verify the Fix
```bash # After fixing allocation issue
# 1. Re-run job nomad job run myjob.nomad // Job registered
# 2. Check allocation nomad job status myjob // Status: running
# 3. Verify task running nomad alloc-status <alloc-id> // Task State: running
# 4. Check logs nomad logs <alloc-id> web // Application logs
# 5. Verify health nomad alloc-status -verbose <alloc-id> // Task Healthy: true
# 6. Monitor stability nomad job status myjob // No new failures ```
Related Issues
- [Fix Nomad Job Pending Forever](/articles/fix-nomad-job-pending-forever)
- [Fix Nomad Client Drained No Allocations](/articles/fix-nomad-client-drained-no-allocations)
- [Fix Vault Secret Rotation Failed](/articles/fix-vault-secret-rotation-failed)