# Docker Node Unavailable: How to Diagnose and Recover Swarm Nodes

Your Swarm node shows as unavailable or unreachable:

bash
docker node ls
ID                  HOSTNAME    STATUS      AVAILABILITY   MANAGER STATUS
abc123              worker-1    Down        Active
def456              manager-1   Ready       Active         Leader

Or tasks are stuck pending because no nodes are available:

bash
docker service ps myapp
ID                  NAME        NODE        DESIRED STATE   CURRENT STATE    ERROR
xyz789              myapp.1     worker-1    Running         Pending           "no suitable node"

Let me walk you through diagnosing and recovering unavailable Swarm nodes.

Understanding Node States

Swarm nodes have two status fields: - Status: Ready (connected) or Down (disconnected) - Availability: Active (can run tasks), Pause (paused), or Drain (evacuating)

Step 1: Check Node Status

Assess the cluster state:

```bash # List all nodes with details docker node ls

# Check specific node docker node inspect worker-1 --pretty

# Check node's manager status docker node inspect worker-1 --format '{{.Status.State}}'

# Check node availability docker node inspect worker-1 --format '{{.Spec.Availability}}' ```

Step 2: Diagnose Connectivity Issues

If a node shows Down, check connectivity:

```bash # From manager, test node connectivity ping worker-1-hostname

# Test Swarm port nc -zv worker-1-hostname 2377

# Check Docker daemon on the node (SSH to node) ssh worker-1 'systemctl status docker'

# Check if Swarm is initialized ssh worker-1 'docker info | grep Swarm' ```

Common connectivity causes:

  1. 1.Docker daemon not running
  2. 2.Network partition or firewall
  3. 3.Host machine shutdown
  4. 4.Swarm join state corrupted

Step 3: Check Docker Daemon on Node

SSH to the problematic node:

```bash # Check Docker service systemctl status docker

# Check Docker logs journalctl -u docker.service -n 100

# Check Swarm status docker info | grep -A 5 "Swarm"

# Check for errors journalctl -u docker.service | grep -i "swarm|raft|member" ```

If Docker daemon is down:

```bash # Start Docker systemctl start docker

# Check if node rejoins automatically docker node ls | grep worker-1 ```

Step 4: Firewall and Network Checks

Verify required ports are open:

```bash # Check if ports are listening netstat -tulpn | grep 2377 netstat -ulpn | grep 4789 netstat -tulpn | grep 7946

# Test from manager nc -zv worker-1 2377 nc -uzv worker-1 4789 nc -zv worker-1 7946 ```

Open ports if blocked:

```bash # iptables iptables -A INPUT -p tcp --dport 2377 -j ACCEPT iptables -A INPUT -p udp --dport 4789 -j ACCEPT iptables -A INPUT -p tcp --dport 7946 -j ACCEPT

# firewalld firewall-cmd --add-port=2377/tcp --permanent firewall-cmd --add-port=4789/udp --permanent firewall-cmd --reload ```

Step 5: Node Availability States

Nodes can be manually set to different availability states:

```bash # Check current availability docker node inspect worker-1 --format '{{.Spec.Availability}}'

# Set node to active (can receive tasks) docker node update --availability active worker-1

# Pause node (no new tasks) docker node update --availability pause worker-1

# Drain node (stop all tasks) docker node update --availability drain worker-1 ```

When draining a node:

```bash # Drain moves tasks to other nodes docker node update --availability drain worker-1

# Watch task migration docker service ps myapp

# Tasks should show "Shutdown" on worker-1 # And "Running" on other nodes ```

Step 6: Recover a Down Node

If node is Down and won't reconnect:

On the problematic node:

```bash # Leave Swarm if stuck docker swarm leave --force

# Clean Swarm state rm -rf /var/lib/docker/swarm

# Restart Docker systemctl restart docker

# Rejoin Swarm (get fresh token from manager) docker swarm join --token TOKEN MANAGER_IP:2377 ```

On manager, verify rejoin:

bash
docker node ls | grep worker-1
# Should show "Ready" status

Step 7: Remove Dead Nodes

If a node is permanently gone:

```bash # Remove the node from cluster docker node rm worker-1

# Force remove if node was manager docker node rm --force manager-2 ```

Before removing, drain tasks:

```bash # Drain first (if node was running tasks) docker node update --availability drain worker-1

# Wait for task migration docker service ps myapp

# Then remove docker node rm worker-1 ```

Step 8: Manager Node Recovery

Manager nodes require special handling:

```bash # Check manager status docker node ls --filter role=manager

# Manager states: # - Leader: primary manager # - Reachable: can reach leader # - Unreachable: cannot reach leader ```

If manager is unreachable but host is accessible:

```bash # SSH to manager ssh manager-2 'systemctl restart docker'

# Check if it reconnects docker node ls ```

If manager host is gone:

```bash # Promote a worker to manager docker node promote worker-2

# Or remove failed manager docker node rm --force manager-2

# Add new manager docker swarm join-token manager # Use token on new node ```

Step 9: Recover from Quorum Loss

If too many managers are down, quorum is lost:

```bash # Check remaining managers docker node ls --filter role=manager

# If quorum lost, swarm is non-functional docker service ls # Won't work ```

Force recovery on remaining manager:

```bash # On the surviving manager docker swarm init --force-new-cluster

# This creates a new single-manager cluster # with existing services preserved

# Add managers back docker swarm join-token manager ```

Step 10: Monitor Node Health

Set up monitoring for node availability:

```bash # Watch node status changes docker node ls --format "{{.Hostname}}: {{.Status}}"

# Check for pending tasks (sign of node issues) docker service ps myapp --filter desired-state=running

# Monitor task failures docker service ps myapp --no-trunc | grep -i error ```

Alert on node status:

bash
# Script to check node status
#!/bin/bash
docker node ls --format "{{.Hostname}} {{.Status}}" | \
while read hostname status; do
  if [ "$status" != "Ready" ]; then
    echo "WARNING: Node $hostname is $status"
  fi
done

Step 11: Handle Node Resource Constraints

Nodes may become unavailable due to resource exhaustion:

```bash # Check node resources (SSH to node) ssh worker-1 'docker info | grep -A 10 "CPUs|Memory"'

# Check running containers ssh worker-1 'docker ps'

# Check container stats ssh worker-1 'docker stats --no-stream' ```

Clear stuck containers:

bash
ssh worker-1 'docker container prune'
ssh worker-1 'docker system prune'

Node Status Decision Matrix

Node StatusAvailabilityAction
ReadyActiveHealthy, no action needed
ReadyPauseResume with --availability active
ReadyDrainNormal during maintenance
DownActiveDiagnose connectivity, restart Docker
DownDrainNode being evacuated, then remove
UnknownAnyNetwork partition, check firewall

Prevention Best Practices

  1. 1.Monitor node health continuously
  2. 2.Maintain manager quorum (odd number of managers, 3 or 5)
  3. 3.Backup Swarm state on managers
  4. 4.Use health checks on services to detect node issues early
  5. 5.Document node recovery procedures for team

Quick Reference

TaskCommand
List nodesdocker node ls
Inspect nodedocker node inspect NODE
Set availabilitydocker node update --availability STATE NODE
Drain nodedocker node update --availability drain NODE
Remove nodedocker node rm NODE
Promote to managerdocker node promote NODE
Demote to workerdocker node demote NODE
Get join tokendocker swarm join-token worker

Node unavailable issues are usually resolved by checking connectivity, restarting the Docker daemon, or rejoining the Swarm. For permanent failures, drain and remove the node, then add a replacement.