Fix Docker Swarm Service Failed - Complete Troubleshooting Guide

# Docker Swarm Service Failed: Complete Troubleshooting Guide

Your Docker Swarm service isn't running properly. Tasks show as "Failed", the service won't start, or replicas aren't distributing correctly. Swarm adds orchestration complexity on top of Docker, making troubleshooting more challenging but also providing more diagnostic tools.

Common error messages and symptoms:

bash

docker service ps myservice
ID     NAME           NODE     DESIRED STATE  CURRENT STATE          ERROR
abc    myservice.1    node1    Running        Failed 2 minutes ago   "task: non-zero exit (1)"

Or:

bash

no suitable node (2 nodes not available due to insufficient resources)

Or:

bash

service not found
network not found

Understanding Swarm Architecture

Docker Swarm has key components:

Manager nodes: Orchestrate services, handle API
Worker nodes: Run tasks (containers)
Services: Definition of what to run (image, replicas, etc.)
Tasks: Individual container instances of a service
Routing mesh: Handles service discovery and load balancing

Quick Diagnosis

Check Service Status

bash

docker service ls
docker service ps <service_name>

Check Swarm Nodes

bash

docker node ls
docker node inspect <node_id>

Check Service Logs

bash

docker service logs <service_name>
docker service logs --tail 100 <service_name>

Check Task Details

bash

docker service ps --no-trunc <service_name>

The --no-trunc shows full error messages.

Common Causes and Fixes

Cause 1: Task Exits Immediately

Container task starts and exits with non-zero code.

Symptoms: ``CURRENT STATE: Failed 5 seconds ago ERROR: "task: non-zero exit (1)"

Diagnosis:

```bash # Check task details docker service ps --no-trunc <service_name>

# Check logs docker service logs <service_name>

# Find which node it ran on docker service ps <service_name> --format "{{.Node}}" ```

Fix 1: Check application logs

bash

# Logs show why app crashed
docker service logs <service_name> --tail 50

Common issues: - Missing environment variables - Configuration errors - Dependencies unavailable

Fix 2: Check container configuration

bash

docker service inspect <service_name> --format '{{json .Spec}}' | jq

Verify: - Image name correct - Environment variables set - Ports configured - Networks attached

Fix 3: Update service with correct configuration

bash

docker service update <service_name> \
  --env-add REQUIRED_VAR=value \
  --args "correct-command"

Cause 2: No Suitable Node Available

Swarm can't find a node meeting constraints.

Symptoms: ``no suitable node (insufficient resources) no suitable node (missing plugin) no suitable node (node not available)

Diagnosis:

```bash # Check node status docker node ls

# Check node resources docker node inspect <node> --format '{{json .Description.Resources}}'

# Check node availability docker node inspect <node> --format '{{.Spec.Availability}}' ```

Fix 1: Check node availability

Nodes might be paused or drained:

```bash # Check availability docker node ls --format "table {{.ID}}\t{{.Status}}\t{{.Availability}}"

# Activate drained node docker node update --availability active <node>

# Check for offline nodes docker node ls | grep Down ```

Fix 2: Check resource constraints

```bash # Service might require resources nodes don't have docker service inspect <service> --format '{{json .Spec.TaskTemplate.Resources}}'

# Remove excessive constraints docker service update <service> --limit-memory="" --limit-cpu="" ```

Fix 3: Check placement constraints

```bash # Service might have constraints no node matches docker service inspect <service> --format '{{json .Spec.TaskTemplate.Placement}}'

# Remove constraints docker service update <service> --constraint-rm "node.role==worker" ```

Fix 4: Add more nodes

bash

# Add worker to swarm
docker swarm join-token worker
# Run the token command on new node

Cause 3: Network Issues

Service network not found or not attached correctly.

Symptoms: ``network <network_name> not found failed to allocate network

Diagnosis:

```bash # Check network exists docker network ls | grep swarm

# Check service networks docker service inspect <service> --format '{{json .Spec.TaskTemplate.Networks}}' ```

Fix 1: Create missing overlay network

bash

docker network create --driver overlay --attachable mynetwork
docker service update <service> --network-add mynetwork

Fix 2: Remove and recreate network

bash

docker service update <service> --network-rm broken-network
docker network rm broken-network
docker network create --driver overlay broken-network
docker service update <service> --network-add broken-network

Fix 3: Check network connectivity

bash

# From a service task
docker exec <container-on-network> ping <service-name>

Services should be reachable by service name on overlay networks.

Cause 4: Image Not Available on Nodes

Nodes can't pull the specified image.

Symptoms: ``failed to pull image: access denied failed to pull image: not found

Diagnosis:

```bash # Check image specification docker service inspect <service> --format '{{.Spec.TaskTemplate.ContainerSpec.Image}}'

# Try pulling manually on a node docker pull <image> ```

Fix 1: Ensure image exists

bash

# Check image exists in registry
docker pull <image>

Fix 2: Authenticate for private images

```bash # Create registry credential docker login myregistry.com

# Create Docker config secret docker config create registry-config ~/.docker/config.json

# Update service with registry auth docker service update <service> --config-add source=registry-config,target=/root/.docker/config.json ```

Or use --with-registry-auth when creating/updating:

bash

docker service create --with-registry-auth --name myservice myregistry.com/myimage

Fix 3: Use locally present images

bash

# Ensure image is on all nodes
for node in $(docker node ls -q); do
  docker -H <node-address> pull <image>
done

Cause 5: Service Rollout Failure

Rolling update fails on new tasks.

Symptoms: - Service shows old version still running - New tasks continuously fail - Rollback not happening

Diagnosis:

```bash # Check task states docker service ps <service>

# Check for failed update tasks docker service ps <service> --filter "desired-state=running" ```

Fix 1: Force rollback

bash

docker service rollback <service>

Fix 2: Adjust rollout settings

bash

docker service update <service> \
  --update-delay 30s \
  --update-failure-action rollback \
  --update-parallelism 1

Fix 3: Pause update and fix

bash

docker service update <service> --update-delay 0
# Fix the issue
docker service update <service> --image <correct-image>

Cause 6: Health Check Failures

Tasks fail health checks repeatedly.

Symptoms: ``task failed health check

Diagnosis:

```bash # Check health check configuration docker service inspect <service> --format '{{json .Spec.TaskTemplate.ContainerSpec.Healthcheck}}'

# Check container health docker inspect <container> --format '{{json .State.Health}}' ```

Fix 1: Improve health check

bash

docker service update <service> \
  --health-cmd "curl -f http://localhost:8080/health" \
  --health-interval 30s \
  --health-retries 3 \
  --health-start-period 60s

Fix 2: Disable health check temporarily

bash

docker service update <service> --health-cmd "none"

Fix 3: Fix the health endpoint

Ensure the application's health endpoint returns proper 200 response.

Cause 7: Node Connectivity Loss

Manager can't reach worker nodes.

Symptoms: - Nodes show as "Down" or "Unknown" - Tasks stuck in "Pending"

Diagnosis:

bash

docker node ls
docker node inspect <node> --format '{{.Status.State}}'

Fix 1: Check network connectivity

```bash # Ping worker from manager ping <node-ip>

# Check required ports (2377, 7946, 4789) nc -zv <node-ip> 2377 nc -zv <node-ip> 7946 nc -zv <node-ip> 4789 ```

Fix 2: Rejoin node to swarm

On the affected node:

bash

docker swarm leave
docker swarm join --token <join-token> <manager-ip>:2377

Fix 3: Remove and re-add node

On manager:

bash

docker node rm <node-id>
# Node rejoins with join token

Cause 8: Manager Quorum Lost

Too many managers down, losing quorum.

Symptoms: - Swarm commands hang or fail - "Swarm is locked" or "Raft busy" - No consensus on cluster state

Diagnosis:

bash

docker node ls
# Count managers
docker node ls --filter "role=manager" | wc -l

Fix 1: Wait for recovery

Raft needs majority of managers:

bash

# If 3 managers, need 2 alive
# If 5 managers, need 3 alive

Fix 2: Force new cluster

On a surviving manager:

bash

docker swarm init --force-new-cluster

Warning: This loses cluster state. Re-add other managers.

Fix 3: Promote worker to manager

bash

docker node promote <worker-node>

Service Update Troubleshooting

Stuck Updates

If service update hangs:

```bash # Check update status docker service inspect <service> --format '{{json .UpdateStatus}}'

# Cancel stuck update docker service update <service> --update-delay 0 ```

Failed Update Actions

```bash # Set failure action docker service create \ --update-failure-action rollback \ --name myservice \ <image>

# Or on update docker service update <service> \ --update-failure-action rollback ```

Monitor Rollout

bash

# Watch service tasks
watch -n 5 'docker service ps <service>'

Verification Steps

1.Verify service running:
2.```bash
3.docker service ls
4.# REPLICAS should match expected
5.`
6.Check tasks healthy:
7.```bash
8.docker service ps <service>
9.# All tasks should show Running
10.`
11.Test service endpoint:
12.```bash
13.curl http://<published-port>
14.`
15.Verify task distribution:
16.```bash
17.docker service ps <service> --format "table {{.Name}}\t{{.Node}}"
18.# Tasks should be spread across nodes
19.`
20.Check service logs:
21.```bash
22.docker service logs --tail 20 <service>
23.# Should show normal operation
24.`
25.Verify nodes healthy:
26.```bash
27.docker node ls
28.# All nodes should be Ready and Active
29.`

Docker Swarm service failures combine container-level issues with orchestration complexity. Start with task logs and node status, then work through image availability, network connectivity, and resource constraints systematically. Use --no-trunc to see full error messages and leverage Swarm's rollback capabilities for failed updates.

Docker Swarm Service Failed: Complete Troubleshooting Guide

Understanding Swarm Architecture

Quick Diagnosis

Check Service Status

Check Swarm Nodes

Check Service Logs

Check Task Details

Common Causes and Fixes

Cause 1: Task Exits Immediately

Cause 2: No Suitable Node Available

Cause 3: Network Issues

Cause 4: Image Not Available on Nodes

Cause 5: Service Rollout Failure

Cause 6: Health Check Failures

Cause 7: Node Connectivity Loss

Cause 8: Manager Quorum Lost

Service Update Troubleshooting

Stuck Updates

Failed Update Actions

Monitor Rollout

Verification Steps

Share this guide

More Docker Troubleshooting Guides

Docker cgroup Memory Limit Not Enforced

Docker User Namespace Remapping Permission Error

Docker Container Read-Only Filesystem

Docker Image Layer Corruption

Docker runc Exec Failed

Docker containerd Not Responding