What's Actually Happening

Your Azure Kubernetes Service (AKS) node shows NotReady status, preventing pods from being scheduled. The node is in the cluster but marked unhealthy.

The Error You'll See

```bash $ kubectl get nodes

NAME STATUS ROLES AGE VERSION aks-agentpool-12345678-vmss000000 NotReady agent 5d v1.28.0 ```

Node conditions:

```bash $ kubectl describe node aks-agentpool-12345678-vmss000000

Conditions: Ready Unknown NodeStatusUnknown Kubelet stopped posting node status. ```

Why This Happens

  1. 1.Kubelet crash - Kubelet process stopped
  2. 2.Network partition - Node cannot reach API server
  3. 3.VM extension failure - Azure VM extension failed
  4. 4.Resource exhaustion - Node out of memory or disk
  5. 5.Outbound connectivity - Node cannot reach Azure APIs

Step 1: Diagnose Node Status

```bash # Check node details: kubectl describe node aks-agentpool-12345678-vmss000000

# Check node conditions: kubectl get node aks-agentpool-12345678-vmss000000 -o jsonpath='{.status.conditions}' | jq

# Check events: kubectl get events --field-selector involvedObject.name=aks-agentpool-12345678-vmss000000

# Check pods on node: kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=aks-agentpool-12345678-vmss000000 ```

Step 2: Access Node via SSH

```bash # Use AKS run command (no SSH needed): az aks nodepool run-command --resource-group myResourceGroup --cluster-name myAKSCluster --nodepool-name agentpool --command "systemctl status kubelet"

# SSH to node: ssh -i ~/.ssh/id_rsa azureuser@<node-ip>

# Check kubelet: sudo systemctl status kubelet sudo journalctl -u kubelet -n 100 ```

Step 3: Fix Kubelet Issues

```bash # Restart kubelet: sudo systemctl restart kubelet

# Check container runtime: sudo systemctl status containerd sudo systemctl restart containerd

# Check disk space: df -h

# Check API server connectivity: curl -k https://kubernetes.default/healthz ```

Step 4: Check Azure VM Extension

```bash # Check extension status: az vmss extension list --resource-group MC_myResourceGroup_myAKSCluster_eastus --vmss-name aks-agentpool-12345678-vmss

# Reimage node if needed: az vmss reimage --resource-group MC_myResourceGroup_myAKSCluster_eastus --vmss-name aks-agentpool-12345678-vmss --instance-id 0 ```

Step 5: Fix Network Connectivity

```bash # Check outbound connectivity: az aks nodepool run-command --resource-group myResourceGroup --cluster-name myAKSCluster --nodepool-name agentpool --command "curl -I https://management.azure.com"

# Check DNS: az aks nodepool run-command --resource-group myResourceGroup --cluster-name myAKSCluster --nodepool-name agentpool --command "nslookup kubernetes.default" ```

Step 6: Drain and Recreate Node

```bash # Cordon node: kubectl cordon aks-agentpool-12345678-vmss000000

# Drain node: kubectl drain aks-agentpool-12345678-vmss000000 --ignore-daemonsets --delete-emptydir-data --force

# Delete node: kubectl delete node aks-agentpool-12345678-vmss000000

# Delete VM instance: az vmss delete-instances --resource-group MC_myResourceGroup_myAKSCluster_eastus --vmss-name aks-agentpool-12345678-vmss --instance-ids 0

# Wait for auto-replacement: kubectl get nodes -w ```

Step 7: Scale Node Pool

bash
# Scale up:
az aks nodepool scale --resource-group myResourceGroup --cluster-name myAKSCluster --name agentpool --node-count 5

Step 8: Monitor Node Health

```bash # Check node status: kubectl get nodes -o wide

# Check resource usage: kubectl top nodes

# Check events: kubectl get events --field-selector involvedObject.kind=Node ```

AKS Node Troubleshooting Checklist

CheckCommandExpected
Node statuskubectl get nodesReady
Kubeletsystemctl statusrunning
Disk spacedf -h< 85%
Networkcurl APIConnected

Verify the Fix

```bash # Check node Ready: kubectl get nodes # Output: STATUS Ready

# Verify pods running: kubectl get pods -o wide # Output: Pods scheduled ```

  • [Fix Azure AKS Authentication Failed](/articles/fix-azure-aks-cluster-authentication-failed)
  • [Fix Kubernetes Node Not Ready](/articles/fix-kubernetes-node-not-ready)