Fix Kubernetes Node NotReady

Introduction

A NotReady node in Kubernetes means the node is registered with the control plane but cannot schedule or run pods. The kubelet—the agent running on each node—either cannot communicate with the API server or is reporting unhealthy status. This is more severe than a node that is simply SchedulingDisabled (cordoned): a NotReady node indicates a broken or degraded worker.

Node health is foundational to cluster operations. When nodes go NotReady, pods may be evicted, workloads disrupted, and capacity reduced. The cause ranges from simple (kubelet stopped) to complex (network partition, certificate expiration, disk pressure).

Symptoms

kubectl get nodes shows STATUS: NotReady for one or more nodes
Pods scheduled on the node show Unknown or NodeLost status
New pods cannot be scheduled on the affected node
Existing pods on the node may be terminated or rescheduled elsewhere
kubectl describe node shows condition Ready: False with various reasons

Common Causes

**Kubelet stopped or crashed**: The node agent is not running or is failing to report
**Network connectivity lost**: Node cannot reach API server due to firewall, routing, or DNS issues
**Resource pressure**: Disk, memory, or PID exhaustion triggers node health failure
**Certificate expiration**: Kubelet or client certificates have expired
**Container runtime failure**: Docker, containerd, or CRI-O is down or unresponsive
**Control plane unreachable**: API server is down, overloaded, or network-partitioned
**CNI plugin failure**: Network plugin is broken, preventing pod networking
**Clock skew**: Node time is too far out of sync with control plane

Step-by-Step Fix

### 1. Check node status and conditions

bash kubectl get nodes kubectl describe node <node-name>

Focus on the Conditions section:

| Condition | Status | Meaning | |-----------|--------|---------| | Ready | False | Node cannot run pods | | MemoryPressure | True | Node is out of memory | | DiskPressure | True | Node is out of disk space | | PIDPressure | True | Too many processes | | NetworkUnavailable | True | CNI/network is broken |

Also check the last heartbeat time—if it's old, the node stopped reporting.

### 2. Check kubelet status on the node

SSH into the affected node and check kubelet:

```bash # Check kubelet service status systemctl status kubelet

# Check kubelet logs journalctl -u kubelet -n 100 --no-pager

# Restart kubelet if stopped sudo systemctl restart kubelet ```

Common kubelet errors: - Unable to register node with API server: Network or auth issue - Container runtime not ready: Docker/containerd is down - Certificate expired: TLS handshake failure - CNI plugin not initialized: Network plugin missing

### 3. Check container runtime

Verify the container runtime is running:

```bash # For containerd systemctl status containerd crictl info

# For Docker (older clusters) systemctl status docker

# For CRI-O systemctl status crio crio status ```

If the runtime is down, kubelet cannot manage containers:

```bash # Restart containerd sudo systemctl restart containerd

# Verify it's working crictl runp --help ```

### 4. Check network connectivity to API server

The node must reach the API server:

```bash # Get API server endpoint kubectl cluster-info

# From the node, test connectivity curl -k https://<api-server-ip>:<port>/healthz ping <api-server-ip>

# Check if kubeconfig is valid cat /etc/kubernetes/kubelet.conf kubectl --kubeconfig=/etc/kubernetes/kubelet.conf cluster-info ```

If API server is unreachable: - Check firewall rules (security groups, iptables, UFW) - Verify routing tables and gateway - Check if control plane nodes are healthy - Look for network partition (split-brain scenarios)

### 5. Check for resource pressure

```bash # Check disk usage df -h df -i # inode usage

# Check memory free -m cat /proc/meminfo

# Check running processes ps aux | wc -l ulimit -u # max user processes

# Check kubelet cgroup systemctl status kubelet ```

If disk is full, clean up:

```bash # Remove unused images crictl rmi --prune

# Remove stopped containers crictl rm $(crictl ps -a --state exited -q)

# Clean up old logs sudo journalctl --vacuum-time=2d

# Clear containerd temp files sudo rm -rf /var/lib/containerd/tmp/* ```

If memory is exhausted, consider: - Killing runaway processes - Reducing pod density on the node - Adding swap (not recommended for production)

### 6. Check certificate expiration

Expired certificates are a common cause of sudden node failures:

```bash # Check kubelet certificate expiration sudo openssl x509 -in /var/lib/kubelet/pki/kubelet.crt -noout -dates

# Check all Kubernetes certificates sudo kubeadm certs check-expiration ```

If certificates are expired:

```bash # Renew certificates (kubeadm clusters) sudo kubeadm certs renew apiserver-kubelet-client sudo kubeadm certs renew kubelet-server sudo kubeadm certs renew all

# Restart kubelet after renewal sudo systemctl restart kubelet ```

### 7. Check CNI plugin status

Network plugin failures cause NetworkUnavailable: True:

```bash # Check CNI pods (usually in kube-system) kubectl get pods -n kube-system -l k8s-app=flannel kubectl get pods -n kube-system -l app=calico-node kubectl get pods -n kube-system -l k8s-app=cilium

# Check CNI logs kubectl logs -n kube-system <cni-pod-name> ```

If CNI pods are down on the node: - Restart the CNI pod: kubectl delete pod <cni-pod-name> -n kube-system - Check CNI configuration on the node - Verify CNI binary exists: ls -la /opt/cni/bin/

### 8. Check system clock skew

Time drift can break TLS handshakes:

```bash # Check node time date

# Check if NTP is running systemctl status chronyd # or systemctl status ntpd

# Sync time sudo chronyc -a makestep # or sudo ntpdate -u pool.ntp.org ```

### 9. Drain and replace the node (last resort)

If the node cannot be recovered:

```bash # Cordon to prevent new scheduling kubectl cordon <node-name>

# Evict all pods kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Remove node from cluster kubectl delete node <node-name>

# Add a new node (cloud provider or kubeadm) # Then verify it joins as Ready kubectl get nodes ```

Debugging Commands Summary

```bash # Check all node conditions kubectl describe node <node-name> | grep -A 10 "Conditions"

# Check kubelet logs journalctl -u kubelet -f

# Check container runtime crictl info # or docker info

# Check network connectivity curl -k https://<api-server>:6443/healthz

# Check certificates openssl x509 -in /var/lib/kubelet/pki/kubelet.crt -noout -dates

# Check resource pressure df -h && free -m && ps aux | wc -l

# Check CNI status kubectl get pods -n kube-system -l k8s-app=calico-node ```

Prevention Checklist

[ ] Set up certificate expiration monitoring and auto-renewal
[ ] Configure log rotation to prevent disk exhaustion
[ ] Set up node health monitoring and alerting
[ ] Use PodDisruptionBudgets to handle node failures gracefully
[ ] Run multiple replicas across different nodes/zones
[ ] Enable cluster autoscaler for automatic node replacement
[ ] Regularly test node failure scenarios in staging
[ ] Document node recovery runbooks for on-call teams

[Fix Kubernetes Pod Stuck in Pending](/articles/fix-kubernetes-pod-stuck-pending)
[Fix Kubernetes Pod CrashLoopBackOff](/articles/fix-kubernetes-pod-crashloopbackoff)
[Fix Kubernetes ImagePullBackOff](/articles/fix-kubernetes-imagepullbackoff)
[Fix Kubernetes Certificate Expired](/articles/fix-kubernetes-certificate-expired)