Fix Azure AKS Cluster NotReady Nodes - Complete Troubleshooting Guide

What's Actually Happening

Your Azure Kubernetes Service (AKS) cluster nodes are showing as NotReady. Pods can't be scheduled on these nodes, existing pods may be evicted, and your workloads are not functioning properly. The nodes might have been working before but suddenly transitioned to NotReady state, or new nodes are failing to become ready after cluster operations.

When a node is NotReady, the kubelet on that node is not responding properly to the control plane. This could be due to kubelet crashes, resource exhaustion, networking issues, or problems with the Azure VM infrastructure underlying the node.

The Error You'll See

AKS NotReady nodes manifest in various ways:

```bash # Nodes showing NotReady status $ kubectl get nodes NAME STATUS ROLES AGE VERSION aks-agentpool-12345678-vmss000000 NotReady agent 10d v1.27.7 aks-agentpool-12345678-vmss000001 Ready agent 10d v1.27.7 aks-agentpool-12345678-vmss000002 NotReady agent 10d v1.27.7

# Node conditions showing problems $ kubectl describe node aks-agentpool-12345678-vmss000000 Conditions: Type Status LastHeartbeatTime Reason Message ---- ------ ----------------- ------ ------- MemoryPressure Unknown Tue, 09 Apr 2026 10:15:23 +0000 NodeStatusUnknown Kubelet stopped posting status. DiskPressure Unknown Tue, 09 Apr 2026 10:15:23 +0000 NodeStatusUnknown Kubelet stopped posting status. PIDPressure Unknown Tue, 09 Apr 2026 10:15:23 +0000 NodeStatusUnknown Kubelet stopped posting status. Ready Unknown Tue, 09 Apr 2026 10:15:23 +0000 NodeStatusUnknown Kubelet stopped posting status.

# Pods stuck in Pending $ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE my-app-abc123-def456 0/1 Pending 0 5m <none> aks-agentpool-12345678-vmss000000 <none>

# Events showing node issues $ kubectl get events --field-selector involvedObject.kind=Node LAST SEEN TYPE REASON OBJECT MESSAGE 5m Normal NodeNotReady node/aks-agentpool-12345678-vmss000000 Node aks-agentpool-12345678-vmss000000 status is now: NodeNotReady

# Azure CLI showing instance issues $ az aks show -g myResourceGroup -n myAKSCluster { "agentPoolProfiles": [ { "count": 3, "provisioningState": "Succeeded", "status": { "errorMessage": "Some nodes are not ready" } } ] }

# Kubelet logs showing errors $ journalctl -u kubelet -n 100 Apr 09 10:15:23 aks-agentpool-12345678-vmss000000 kubelet[2345]: E0409 10:15:23.456789 2345 kubelet.go:2345] "Failed to register with API server" err="Post \"https://myakscluster.hcp.eastus.azmk8s.io:443/api/v1/nodes\": dial tcp: lookup myakscluster.hcp.eastus.azmk8s.io: no such host"

# Resource exhaustion errors Apr 09 10:15:24 kubelet[2345]: E0409 10:15:24.123456 2345 eviction_manager.go:234] "Eviction manager: attempting to reclaim ephemeral-storage" ```

Additional symptoms: - New nodes not joining cluster after scaling - Nodes cycling between Ready and NotReady - Pods being evicted from NotReady nodes - Cannot SSH into nodes - Node shows high CPU/memory usage before NotReady - Cluster autoscaler not working - Deployment rollout stuck waiting for nodes

Why This Happens

1.Kubelet Process Crashed: The kubelet process on the node crashed or stopped running. Without kubelet, the node can't communicate with the control plane.
2.Node Resource Exhaustion: The node ran out of memory, disk, or CPU resources. System processes including kubelet can't function properly.
3.Network Connectivity Issues: Network problems between the node and the AKS control plane. DNS resolution failures, firewall rules, or Azure network issues.
4.Azure VM Issues: The underlying Azure VM has problems - deallocated, failed health checks, or resource constraints at Azure level.
5.Container Runtime Issues: ContainerD or Docker runtime crashed or has internal errors preventing kubelet from managing containers.
6.CNI Plugin Problems: Azure CNI or network plugin issues prevent pod networking from working, causing kubelet to report unhealthy status.
7.Cluster Autoscaler Issues: During scale-up, new nodes fail to bootstrap correctly or join the cluster.
8.Expired Certificates: Node TLS certificates expired, preventing secure communication with control plane.

Step 1: Diagnose Node Status

Identify which nodes are NotReady and check their conditions.

```bash # Check all nodes status kubectl get nodes -o wide

# Get detailed node information kubectl describe node aks-agentpool-12345678-vmss000000

# Check node conditions specifically kubectl get nodes -o custom-columns='NAME:.metadata.name,STATUS:.status.conditions[?(@.type=="Ready")].status,REASON:.status.conditions[?(@.type=="Ready")].reason'

# Show all conditions for NotReady nodes kubectl get nodes -o json | jq '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="Unknown" or .status=="False")) | {name: .metadata.name, conditions: .status.conditions}'

# Check for nodes with specific issues: kubectl get nodes -o json | jq '.items[] | select(.status.conditions[] | select(.reason=="NodeStatusUnknown")) | .metadata.name'

# List pods on NotReady nodes kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=aks-agentpool-12345678-vmss000000

# Check node resource usage (if node is accessible) kubectl top node

# Get node events kubectl get events --field-selector involvedObject.kind=Node --sort-by='.lastTimestamp'

# Check kubelet status on node (requires access) # Via Azure Serial Console or SSH systemctl status kubelet

# Check kubelet logs journalctl -u kubelet -n 100 --no-pager ```

Step 2: Check Azure VM Status

Verify the underlying Azure VMs are healthy.

```bash # Get AKS cluster info az aks show -g myResourceGroup -n myAKSCluster -o table

# List VMSS instances az vmss list-instances -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-12345678-vmss -o table

# Check specific VM instance az vmss get-instance-view -g MC_myResourceGroup_myAKSCluster_eastus --vmss-name aks-agentpool-12345678-vmss --instance-id 0

# Check VM health az vmss list-instance-connection-info -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-12345678-vmss

# Check VM sizes and capacity az vmss show -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-12345678-vmss --query "sku"

# Check for Azure platform issues az vm list -g MC_myResourceGroup_myAKSCluster_eastus -o table

# Check Azure resource health az resource list -g MC_myResourceGroup_myAKSCluster_eastus --query "[].{Name:name, Type:type, State:provisioningState}" -o table

# Check for deallocated VMs az vmss list-instances -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-12345678-vmss --query "[?instanceView.statuses[?code=='PowerState/deallocated']]" -o table

# Check VM extension status az vmss list-instances -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-12345678-vmss --query "[].{Name:name, ProvisioningState:provisioningState}" -o table

# Get Azure Activity Log for node issues az monitor activity-log list --resource-group MC_myResourceGroup_myAKSCluster_eastus --caller $(az account show --query user.name -o tsv) --max-events 50 -o table ```

Step 3: Access Node for Direct Troubleshooting

Connect to the problematic node for direct investigation.

```bash # Method 1: Use aks command invoke (no SSH required) az aks command invoke -g myResourceGroup -n myAKSCluster --command "kubectl get nodes" --resource-group myResourceGroup

# Method 2: SSH into node via debug pod kubectl debug node/aks-agentpool-12345678-vmss000000 -it --image=mcr.microsoft.com/dotnet/runtime-deps:6.0

# Inside debug container, check kubelet: chroot /host systemctl status kubelet journalctl -u kubelet -n 50

# Method 3: SSH directly (if configured) # First, get node IP: kubectl get node aks-agentpool-12345678-vmss000000 -o wide

# SSH to node: ssh -i ~/.ssh/id_rsa azureuser@10.240.0.4

# Method 4: Use Azure Bastion or Serial Console # In Azure Portal: # VMSS > Instances > Select instance > Connect > Bastion or Serial Console

# Once on node, check services: systemctl status kubelet systemctl status containerd systemctl status coredns

# Check kubelet logs: journalctl -u kubelet -f

# Check containerd: journalctl -u containerd -n 50

# Check system resources: df -h free -m top

# Check network connectivity: curl -k https://myakscluster.hcp.eastus.azmk8s.io/healthz ping 10.240.0.1 # API server IP

# Check DNS resolution: nslookup myakscluster.hcp.eastus.azmk8s.io cat /etc/resolv.conf

# Check kubelet configuration: cat /etc/kubernetes/kubelet.conf

# Check certificates: openssl x509 -in /etc/kubernetes/certs/apiserver.crt -text -noout | grep -A 2 Validity ```

Step 4: Check Kubelet and Container Runtime

Investigate kubelet and container runtime issues.

```bash # On the node (via SSH or debug pod):

# Check kubelet status systemctl status kubelet

# Check if kubelet is running ps aux | grep kubelet

# Restart kubelet if stopped sudo systemctl restart kubelet

# Check kubelet logs for errors journalctl -u kubelet --since "10 minutes ago"

# Common kubelet errors: # - Certificate errors # - API server unreachable # - CNI plugin failures

# Check containerd status systemctl status containerd

# Restart containerd if needed sudo systemctl restart containerd

# Check containerd logs journalctl -u containerd -n 100

# Check container processes crictl ps crictl pods

# Check if container runtime is responsive crictl info

# Check container logs crictl logs <container-id>

# Check for zombie processes ps aux | awk '$8 ~ /Z/ {print}'

# Check for disk pressure df -h | grep -E "Filesystem|/$|/var" du -sh /var/lib/containerd/* du -sh /var/lib/kubelet/*

# Check memory free -h cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable"

# Check kubelet configuration cat /var/lib/kubelet/config.yaml

# Verify kubelet is configured correctly kubelet --version ```

Step 5: Check Network and DNS

Investigate network connectivity issues.

```bash # On the node:

# Test API server connectivity curl -k https://<api-server-address>/healthz

# Test from inside the node: kubectl get --raw='/healthz'

# Check DNS resolution nslookup myakscluster.hcp.eastus.azmk8s.io dig myakscluster.hcp.eastus.azmk8s.io

# Check resolv.conf cat /etc/resolv.conf

# Test DNS with specific server nslookup myakscluster.hcp.eastus.azmk8s.io 168.63.129.16

# Check network interfaces ip addr ip route

# Check iptables sudo iptables -L -n -v | head -50

# Check if CNI is configured ls -la /etc/cni/net.d/ cat /etc/cni/net.d/*.conflist

# Check Azure CNI status # Verify Azure CNI binaries ls -la /opt/cni/bin/

# Check pod network ip netns list # If there are pod namespaces, check connectivity

# Test outbound connectivity curl -I https://mcr.microsoft.com ping -c 3 8.8.8.8

# Check for network policies blocking traffic kubectl get networkpolicies --all-namespaces

# Check if node can reach Azure services curl -I https://management.azure.com

# Test with specific API server IP: ping <api-server-private-ip> ```

Step 6: Check Resource Pressure

Investigate resource exhaustion issues.

```bash # On the node:

# Check memory free -h cat /proc/meminfo | head -20

# Check for OOM kills dmesg | grep -i "out of memory" journalctl -k | grep -i "oom"

# Check disk usage df -h

# Check inode usage df -i

# Check large directories du -sh /var/lib/containerd/* 2>/dev/null du -sh /var/lib/kubelet/* 2>/dev/null du -sh /var/log/* 2>/dev/null

# Check for disk pressure grep -i "disk.*pressure" /var/log/syslog journalctl -u kubelet | grep -i "pressure"

# Check CPU load top -bn1 | head -20 uptime

# Check zombie processes ps aux | awk '$8 ~ /Z/ {print $2}' | wc -l

# Check running pods on node crictl pods

# Check container resource usage crictl stats

# Clean up if disk full # Remove unused container images crictl rmi --prune

# Remove unused containers crictl rm $(crictl ps -aq) 2>/dev/null

# Clean up old logs find /var/log -type f -name "*.log" -mtime +7 -delete journalctl --vacuum-time=3d ```

Step 7: Restart or Reimage Node

Try restarting or reimaging the problematic node.

```bash # Method 1: Restart kubelet (from node) sudo systemctl restart kubelet

# Check if node becomes Ready kubectl get node aks-agentpool-12345678-vmss000000 -w

# Method 2: Reboot the VM az vmss restart -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-12345678-vmss --instance-ids 0

# Wait for node to come back kubectl get nodes -w

# Method 3: Reimage the VM (fresh OS) az vmss reimage -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-12345678-vmss --instance-id 0

# Method 4: Delete and let cluster autoscaler recreate kubectl delete node aks-agentpool-12345678-vmss000000

# Check autoscaler creates new node kubectl get nodes -w

# Method 5: Scale down and up az aks scale -g myResourceGroup -n myAKSCluster --node-count 2 # Wait for scale down az aks scale -g myResourceGroup -n myAKSCluster --node-count 3

# Method 6: Upgrade node pool (refreshes nodes) az aks nodepool upgrade -g myResourceGroup --cluster-name myAKSCluster -n agentpool --kubernetes-version 1.27.7

# Check node status after restart kubectl get nodes kubectl describe node aks-agentpool-12345678-vmss000000 ```

Step 8: Check Node Bootstrap Issues

Investigate issues with new nodes not joining cluster.

```bash # Check AKS cluster provisioning state az aks show -g myResourceGroup -n myAKSCluster --query "provisioningState"

# Check node pool status az aks nodepool list -g myResourceGroup --cluster-name myAKSCluster -o table

# Check VMSS provisioning state az vmss show -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-12345678-vmss --query "provisioningState"

# Check for failed VM extensions az vmss list-instances -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-12345678-vmss --query "[].{Name:name, Extensions:instanceView.extensions}" -o json

# Check VMAccess extension specifically az vmss extension list -g MC_myResourceGroup_myAKSCluster_eastus --vmss-name aks-agentpool-12345678-vmss -o table

# Check custom script extension logs (on node via SSH) ls -la /var/lib/waagent/custom-script/download/ cat /var/lib/waagent/custom-script/download/0/stdout cat /var/lib/waagent/custom-script/download/0/stderr

# Check Azure CNI extension az vmss extension show -g MC_myResourceGroup_myAKSCluster_eastus --vmss-name aks-agentpool-12345678-vmss -n azure-cni

# Check boot diagnostics for startup issues az vm boot-diagnostics get-boot-log-uris -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-12345678-vmss000000

# Check if subnet has enough IPs az network vnet subnet show -g MC_myResourceGroup_myAKSCluster_eastus --vnet-name aks-vnet -n aks-subnet --query "ipConfigurations"

# Check NSG rules az network nsg show -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-nsg --query "securityRules" ```

Step 9: Check Control Plane Connectivity

Verify nodes can communicate with the control plane.

```bash # Get API server address az aks show -g myResourceGroup -n myAKSCluster --query "fqdn" -o tsv

# Test from local machine: curl -k https://myakscluster.hcp.eastus.azmk8s.io/healthz

# Check API server private endpoint az aks show -g myResourceGroup -n myAKSCluster --query "apiServerAccessProfile.enablePrivateCluster"

# If private cluster, check private endpoint az network private-endpoint show -g MC_myResourceGroup_myAKSCluster_eastus -n myAKSCluster --query "networkInterfaces"

# Check if API server authorized IP ranges are blocking az aks show -g myResourceGroup -n myAKSCluster --query "apiServerAccessProfile.authorizedIpRanges"

# Get your public IP curl -s ifconfig.me

# Check if your IP is in authorized ranges

# Check control plane status az aks show -g myResourceGroup -n myAKSCluster --query "powerState"

# Check addon status az aks show -g myResourceGroup -n myAKSCluster --query "addonProfiles"

# Check for Azure service health issues az network watcher list -o table ```

Step 10: Implement Monitoring and Prevention

Set up monitoring and alerts for node health.

```bash # Enable Azure Monitor for containers (if not enabled) az aks enable-addons -a monitoring -g myResourceGroup -n myAKSCluster

# Create alert for NotReady nodes: # In Azure Monitor:

# Using Azure CLI to create alert: az monitor metrics alert create \ -n "aks-node-notready-alert" \ -g myResourceGroup \ --scopes /subscriptions/<sub-id>/resourcegroups/MC_myResourceGroup_myAKSCluster_eastus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-agentpool-12345678-vmss \ --condition "avg Percentage CPU > 95" \ --window-size 5m \ --evaluation-frequency 1m \ --action <action-group-id>

# Create Kubernetes events monitoring cat > node-health-monitor.yaml << 'EOF' apiVersion: batch/v1 kind: CronJob metadata: name: node-health-check spec: schedule: "*/5 * * * *" jobTemplate: spec: template: spec: serviceAccountName: node-checker containers: - name: checker image: bitnami/kubectl:latest command: - /bin/sh - -c - | NOT_READY=$(kubectl get nodes -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name') if [ -n "$NOT_READY" ]; then echo "WARNING: NotReady nodes detected: $NOT_READY" exit 1 fi restartPolicy: OnFailure EOF

kubectl apply -f node-health-monitor.yaml

# Set up node problem detector kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml

# Enable node auto-repair az aks update -g myResourceGroup -n myAKSCluster --auto-upgrade-channel stable

# Configure node drain on upgrade az aks update -g myResourceGroup -n myAKSCluster --max-surge 10% ```

Checklist for Fixing AKS NotReady Nodes

Step	Action	Command	Status
1	Diagnose node status	`kubectl describe node <node-name>`	☐
2	Check Azure VM status	`az vmss list-instances`	☐
3	Access node for troubleshooting	`kubectl debug node/<node>`	☐
4	Check kubelet and container runtime	`systemctl status kubelet`	☐
5	Check network and DNS	`curl -k https://api-server/healthz`	☐
6	Check resource pressure	`free -h`, `df -h`	☐
7	Restart or reimage node	`az vmss restart` or `kubectl delete node`	☐
8	Check node bootstrap issues	Check VM extensions and logs	☐
9	Check control plane connectivity	Test API server from node	☐
10	Implement monitoring	Enable Azure Monitor for containers	☐

Verify the Fix

After fixing NotReady node issues:

```bash # 1. All nodes show Ready status kubectl get nodes # All should show Ready

# 2. Node conditions are healthy kubectl describe node aks-agentpool-12345678-vmss000000 | grep -A5 Conditions: # Ready should be True

# 3. Pods can be scheduled kubectl run test --image=nginx --restart=Never kubectl get pods -w

# 4. No resource pressure kubectl describe node aks-agentpool-12345678-vmss000000 | grep -E "MemoryPressure|DiskPressure" # Should show False

# 5. Kubelet is running (on node) systemctl status kubelet # Should show active (running)

# 6. Container runtime is working crictl ps # Should list containers

# 7. Network connectivity works curl -k https://<api-server>/healthz # Should return ok

# 8. DNS resolution works nslookup kubernetes.default # Should resolve

# 9. Azure VM is healthy az vmss list-instances -g MC_myResourceGroup_myAKSCluster_eastus -n aks-agentpool-12345678-vmss --query "[].provisioningState" # Should show Succeeded

# 10. Monitoring shows healthy # Check Azure Monitor for containers in portal ```

[Fix Kubernetes Namespace Terminating](/articles/fix-kubernetes-namespace-terminating) - Namespace stuck
[Fix Kubernetes Horizontal Pod Autoscaler Not Scaling](/articles/fix-kubernetes-horizontal-pod-autoscaler-not-scaling) - HPA issues
[Fix Azure AKS Node Not Ready](/articles/fix-azure-aks-node-not-ready) - Similar node issues
[Fix AWS EKS Node Not Ready](/articles/fix-aws-eks-node-not-ready) - AWS equivalent
[Fix GCP GKE Node Not Ready](/articles/fix-gcp-gke-node-not-ready) - GCP equivalent
[Fix Istio Sidecar Injection Not Working](/articles/fix-istio-sidecar-injection-not-working) - Service mesh issues
[Fix Prometheus Remote Write Failing](/articles/fix-prometheus-remote-write-failing) - Monitoring issues

Fix Azure AKS Cluster Not Ready Nodes

What's Actually Happening

The Error You'll See

Why This Happens

Step 1: Diagnose Node Status

Step 2: Check Azure VM Status

Step 3: Access Node for Direct Troubleshooting

Step 4: Check Kubelet and Container Runtime

Step 5: Check Network and DNS

Step 6: Check Resource Pressure

Step 7: Restart or Reimage Node

Step 8: Check Node Bootstrap Issues

Step 9: Check Control Plane Connectivity

Step 10: Implement Monitoring and Prevention

Checklist for Fixing AKS NotReady Nodes

Verify the Fix

Related Issues

Share this guide

More Azure Troubleshooting Guides

Azure Cosmos DB Throughput Throttling

Azure SQL Managed Instance VNet Routing Issue

Fix Azure VM Not Starting

Azure SQL Managed Instance Time Zone Not Changing

Azure SQL Database TDE Key Error

Azure SQL Elastic Pool Quota Exceeded