# Docker Node Unavailable: How to Diagnose and Recover Swarm Nodes
Your Swarm node shows as unavailable or unreachable:
docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
abc123 worker-1 Down Active
def456 manager-1 Ready Active LeaderOr tasks are stuck pending because no nodes are available:
docker service ps myapp
ID NAME NODE DESIRED STATE CURRENT STATE ERROR
xyz789 myapp.1 worker-1 Running Pending "no suitable node"Let me walk you through diagnosing and recovering unavailable Swarm nodes.
Understanding Node States
Swarm nodes have two status fields:
- Status: Ready (connected) or Down (disconnected)
- Availability: Active (can run tasks), Pause (paused), or Drain (evacuating)
Step 1: Check Node Status
Assess the cluster state:
```bash # List all nodes with details docker node ls
# Check specific node docker node inspect worker-1 --pretty
# Check node's manager status docker node inspect worker-1 --format '{{.Status.State}}'
# Check node availability docker node inspect worker-1 --format '{{.Spec.Availability}}' ```
Step 2: Diagnose Connectivity Issues
If a node shows Down, check connectivity:
```bash # From manager, test node connectivity ping worker-1-hostname
# Test Swarm port nc -zv worker-1-hostname 2377
# Check Docker daemon on the node (SSH to node) ssh worker-1 'systemctl status docker'
# Check if Swarm is initialized ssh worker-1 'docker info | grep Swarm' ```
Common connectivity causes:
- 1.Docker daemon not running
- 2.Network partition or firewall
- 3.Host machine shutdown
- 4.Swarm join state corrupted
Step 3: Check Docker Daemon on Node
SSH to the problematic node:
```bash # Check Docker service systemctl status docker
# Check Docker logs journalctl -u docker.service -n 100
# Check Swarm status docker info | grep -A 5 "Swarm"
# Check for errors journalctl -u docker.service | grep -i "swarm|raft|member" ```
If Docker daemon is down:
```bash # Start Docker systemctl start docker
# Check if node rejoins automatically docker node ls | grep worker-1 ```
Step 4: Firewall and Network Checks
Verify required ports are open:
```bash # Check if ports are listening netstat -tulpn | grep 2377 netstat -ulpn | grep 4789 netstat -tulpn | grep 7946
# Test from manager nc -zv worker-1 2377 nc -uzv worker-1 4789 nc -zv worker-1 7946 ```
Open ports if blocked:
```bash # iptables iptables -A INPUT -p tcp --dport 2377 -j ACCEPT iptables -A INPUT -p udp --dport 4789 -j ACCEPT iptables -A INPUT -p tcp --dport 7946 -j ACCEPT
# firewalld firewall-cmd --add-port=2377/tcp --permanent firewall-cmd --add-port=4789/udp --permanent firewall-cmd --reload ```
Step 5: Node Availability States
Nodes can be manually set to different availability states:
```bash # Check current availability docker node inspect worker-1 --format '{{.Spec.Availability}}'
# Set node to active (can receive tasks) docker node update --availability active worker-1
# Pause node (no new tasks) docker node update --availability pause worker-1
# Drain node (stop all tasks) docker node update --availability drain worker-1 ```
When draining a node:
```bash # Drain moves tasks to other nodes docker node update --availability drain worker-1
# Watch task migration docker service ps myapp
# Tasks should show "Shutdown" on worker-1 # And "Running" on other nodes ```
Step 6: Recover a Down Node
If node is Down and won't reconnect:
On the problematic node:
```bash # Leave Swarm if stuck docker swarm leave --force
# Clean Swarm state rm -rf /var/lib/docker/swarm
# Restart Docker systemctl restart docker
# Rejoin Swarm (get fresh token from manager) docker swarm join --token TOKEN MANAGER_IP:2377 ```
On manager, verify rejoin:
docker node ls | grep worker-1
# Should show "Ready" statusStep 7: Remove Dead Nodes
If a node is permanently gone:
```bash # Remove the node from cluster docker node rm worker-1
# Force remove if node was manager docker node rm --force manager-2 ```
Before removing, drain tasks:
```bash # Drain first (if node was running tasks) docker node update --availability drain worker-1
# Wait for task migration docker service ps myapp
# Then remove docker node rm worker-1 ```
Step 8: Manager Node Recovery
Manager nodes require special handling:
```bash # Check manager status docker node ls --filter role=manager
# Manager states: # - Leader: primary manager # - Reachable: can reach leader # - Unreachable: cannot reach leader ```
If manager is unreachable but host is accessible:
```bash # SSH to manager ssh manager-2 'systemctl restart docker'
# Check if it reconnects docker node ls ```
If manager host is gone:
```bash # Promote a worker to manager docker node promote worker-2
# Or remove failed manager docker node rm --force manager-2
# Add new manager docker swarm join-token manager # Use token on new node ```
Step 9: Recover from Quorum Loss
If too many managers are down, quorum is lost:
```bash # Check remaining managers docker node ls --filter role=manager
# If quorum lost, swarm is non-functional docker service ls # Won't work ```
Force recovery on remaining manager:
```bash # On the surviving manager docker swarm init --force-new-cluster
# This creates a new single-manager cluster # with existing services preserved
# Add managers back docker swarm join-token manager ```
Step 10: Monitor Node Health
Set up monitoring for node availability:
```bash # Watch node status changes docker node ls --format "{{.Hostname}}: {{.Status}}"
# Check for pending tasks (sign of node issues) docker service ps myapp --filter desired-state=running
# Monitor task failures docker service ps myapp --no-trunc | grep -i error ```
Alert on node status:
# Script to check node status
#!/bin/bash
docker node ls --format "{{.Hostname}} {{.Status}}" | \
while read hostname status; do
if [ "$status" != "Ready" ]; then
echo "WARNING: Node $hostname is $status"
fi
doneStep 11: Handle Node Resource Constraints
Nodes may become unavailable due to resource exhaustion:
```bash # Check node resources (SSH to node) ssh worker-1 'docker info | grep -A 10 "CPUs|Memory"'
# Check running containers ssh worker-1 'docker ps'
# Check container stats ssh worker-1 'docker stats --no-stream' ```
Clear stuck containers:
ssh worker-1 'docker container prune'
ssh worker-1 'docker system prune'Node Status Decision Matrix
| Node Status | Availability | Action |
|---|---|---|
| Ready | Active | Healthy, no action needed |
| Ready | Pause | Resume with --availability active |
| Ready | Drain | Normal during maintenance |
| Down | Active | Diagnose connectivity, restart Docker |
| Down | Drain | Node being evacuated, then remove |
| Unknown | Any | Network partition, check firewall |
Prevention Best Practices
- 1.Monitor node health continuously
- 2.Maintain manager quorum (odd number of managers, 3 or 5)
- 3.Backup Swarm state on managers
- 4.Use health checks on services to detect node issues early
- 5.Document node recovery procedures for team
Quick Reference
| Task | Command |
|---|---|
| List nodes | docker node ls |
| Inspect node | docker node inspect NODE |
| Set availability | docker node update --availability STATE NODE |
| Drain node | docker node update --availability drain NODE |
| Remove node | docker node rm NODE |
| Promote to manager | docker node promote NODE |
| Demote to worker | docker node demote NODE |
| Get join token | docker swarm join-token worker |
Node unavailable issues are usually resolved by checking connectivity, restarting the Docker daemon, or rejoining the Swarm. For permanent failures, drain and remove the node, then add a replacement.