Your pods aren't being scheduled on certain nodes, and you suspect taints are blocking them. Taints and tolerations work together to repel pods from nodes unless pods explicitly tolerate the taint. This mechanism is useful for dedicating nodes to specific workloads, but misconfiguration can prevent pods from scheduling.

Understanding Taints and Tolerations

Taints are applied to nodes and consist of key, value, and effect. Tolerations are applied to pods and allow them to schedule on nodes with matching taints. Without matching toleration, a pod won't schedule on a tainted node (depending on the taint effect).

Taint effects: - NoSchedule: Pod won't schedule (existing pods unaffected) - NoExecute: Pod won't schedule and existing pods are evicted - PreferNoSchedule: Scheduler tries to avoid, but not guaranteed

Diagnosis Commands

Check node taints:

```bash # List all nodes with taints kubectl get nodes -o custom-columns='NAME:.metadata.name,TAINTS:.spec.taints'

# Check specific node taints kubectl describe node node-name | grep -A 5 Taints

# Get detailed taint information kubectl get node node-name -o jsonpath='{.spec.taints}' ```

Check pod tolerations:

```bash # Check pod tolerations kubectl get pod pod-name -n namespace -o yaml | grep -A 20 tolerations

# Get tolerations in JSON kubectl get pod pod-name -n namespace -o jsonpath='{.spec.tolerations}'

# List pods with tolerations kubectl get pods -n namespace -o custom-columns='NAME:.metadata.name,TOLERATIONS:.spec.tolerations' ```

Check scheduling events:

bash
# Check events for taint-related scheduling failures
kubectl describe pod pending-pod -n namespace | grep -A 10 "Events:"
kubectl get events -n namespace | grep -i "taint\|node(s) had taint"

Common Solutions

Solution 1: Add Tolerations to Pods

Pods need tolerations to schedule on tainted nodes:

bash
# Check node taint
kubectl describe node node-name | grep Taints
# Example: dedicated=gpu:NoSchedule

Add matching toleration:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
  containers:
  - name: app
    image: myimage

Multiple tolerations:

yaml
tolerations:
- key: "dedicated"
  operator: "Equal"
  value: "gpu"
  effect: "NoSchedule"
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300  # Stay for 5 minutes when node not ready

Solution 2: Remove Node Taints

Remove taints that are blocking normal pods:

```bash # Remove specific taint kubectl taint nodes node-name key:value:effect-

# Example: Remove GPU taint kubectl taint nodes node-name dedicated=gpu:NoSchedule-

# Remove all taints with a key kubectl taint nodes node-name dedicated- ```

Check taint removal:

bash
# Verify taint is removed
kubectl describe node node-name | grep Taints

Solution 3: Use Exists Operator

Exists operator matches any taint with the key:

yaml
tolerations:
- key: "dedicated"
  operator: "Exists"
  effect: "NoSchedule"
# This tolerates dedicated=anything:NoSchedule

Tolerate all taints (dangerous):

yaml
tolerations:
- operator: "Exists"
# Matches all taints - pod can schedule anywhere

Solution 4: Fix Toleration Effect Mismatch

Toleration effect must match taint effect:

```yaml # Node taint: dedicated=gpu:NoExecute # Wrong toleration: tolerations: - key: "dedicated" operator: "Equal" value: "gpu" effect: "NoSchedule" # Doesn't match NoExecute

# Correct toleration: tolerations: - key: "dedicated" operator: "Equal" value: "gpu" effect: "NoExecute" ```

Tolerate multiple effects:

yaml
tolerations:
- key: "dedicated"
  operator: "Equal"
  value: "gpu"
  effect: "NoSchedule"
- key: "dedicated"
  operator: "Equal"
  value: "gpu"
  effect: "NoExecute"

Solution 5: Handle Built-in Taints

Kubernetes adds built-in taints for node conditions:

bash
# Common built-in taints:
# node.kubernetes.io/not-ready: NoExecute
# node.kubernetes.io/unreachable: NoExecute
# node.kubernetes.io/memory-pressure: NoSchedule
# node.kubernetes.io/disk-pressure: NoSchedule
# node.kubernetes.io/pid-pressure: NoSchedule
# node.kubernetes.io/network-unavailable: NoSchedule
# node.kubernetes.io/unschedulable: NoSchedule

Add tolerations for built-in taints:

yaml
tolerations:
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300  # Stay 5 minutes when node not ready
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300
- key: "node.kubernetes.io/memory-pressure"
  operator: "Exists"
  effect: "NoSchedule"

DaemonSet pods should tolerate all built-in taints:

yaml
# DaemonSet tolerations for system components
tolerations:
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
- key: "node.kubernetes.io/disk-pressure"
  operator: "Exists"
- key: "node.kubernetes.io/memory-pressure"
  operator: "Exists"
- key: "node.kubernetes.io/pid-pressure"
  operator: "Exists"
- key: "node.kubernetes.io/unschedulable"
  operator: "Exists"
effect: "NoSchedule"  # Or omit for all effects

Solution 6: Add Node Taints

Add taints to dedicate nodes:

```bash # Add taint to node kubectl taint nodes node-name key=value:effect

# Example: Dedicate node for GPU workloads kubectl taint nodes gpu-node dedicated=gpu:NoSchedule

# Example: Reserve node for monitoring kubectl taint nodes monitor-node purpose=monitoring:NoSchedule ```

Solution 7: Use TolerationSeconds

TolerationSeconds limits how long a pod stays on tainted node:

yaml
tolerations:
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300  # Pod evicted after 5 minutes

No tolerationSeconds means forever:

yaml
tolerations:
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute"
# Pod stays indefinitely when node not ready

Solution 8: Fix Control Plane Node Access

Control plane nodes have taints by default:

bash
# Check control plane taints
kubectl describe node control-plane-node | grep Taints
# Usually: node-role.kubernetes.io/control-plane:NoSchedule

Add toleration for control plane:

yaml
tolerations:
- key: "node-role.kubernetes.io/control-plane"
  operator: "Exists"
  effect: "NoSchedule"
- key: "node-role.kubernetes.io/master"
  operator: "Exists"
  effect: "NoSchedule"  # Legacy taint name

Remove control plane taint (if needed):

bash
# Remove control plane taint to allow all pods
kubectl taint nodes control-plane-node node-role.kubernetes.io/control-plane:NoSchedule-
kubectl taint nodes control-plane-node node-role.kubernetes.io/master:NoSchedule-

Solution 9: Debug Scheduling Failure

Check why pod isn't scheduling:

```bash # Describe pending pod kubectl describe pod pending-pod -n namespace

# Look for taint-related messages kubectl get events -n namespace | grep -i "taint|node(s) had taint"

# Check scheduler logs kubectl logs -n kube-system kube-scheduler-master | grep -i taint ```

Solution 10: Check All Nodes' Taints

Comprehensive taint check:

```bash # List all taints in cluster kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'

# Or without jq for node in $(kubectl get nodes -o name); do echo "$node:" kubectl get $node -o jsonpath='{.spec.taints}' echo "" done ```

Verification

After fixing taint/toleration issues:

```bash # Verify node taints kubectl describe node node-name | grep Taints

# Verify pod tolerations kubectl get pod pod-name -n namespace -o yaml | grep -A 20 tolerations

# Check pod scheduled on expected node kubectl get pod pod-name -n namespace -o wide

# Verify pod is running kubectl get pods -n namespace ```

Taint and Toleration Quick Reference

Common Taint Patterns

TaintPurposeTypical Toleration
node-role.kubernetes.io/control-plane:NoScheduleProtect control planeSystem components only
dedicated=group:NoScheduleDedicated node poolPods for that group
special-hardware=gpu:NoScheduleGPU nodesGPU workloads
node.kubernetes.io/not-ready:NoExecuteNode unhealthyCritical pods with tolerationSeconds
node.kubernetes.io/memory-pressure:NoScheduleMemory pressureCritical pods only

Toleration Operator Patterns

OperatorMatchesExample
EqualExact key=value:effectkey=x, value=y, effect=NoSchedule
ExistsAny value with key+effectkey=x, effect=NoSchedule
Exists (no key)All taintsoperator: Exists

Taint/Toleration Issues Summary

IssueCheckSolution
Pod can't schedule on nodekubectl describe podAdd matching toleration
Effect mismatchkubectl describe pod/nodeMatch effect exactly
Key mismatchkubectl describe pod/nodeMatch key exactly
Control plane blockedkubectl describe nodeAdd control-plane toleration
Built-in taint blockingCheck node conditionsAdd built-in tolerations
Pod evicted (NoExecute)kubectl get eventsCheck tolerationSeconds
All pods blocked on nodekubectl describe nodeRemove node taint

Prevention Best Practices

Document all taints and their purpose. Add appropriate tolerations to pods that need specific nodes. Use tolerationSeconds for critical pods on unhealthy nodes. Be careful with operator: Exists without key. Test taint changes before applying to production. Monitor pods pending due to taints. Keep control plane taints unless explicitly needed otherwise.

Taint and toleration issues are straightforward once you see the exact taint on the node - add a matching toleration to your pod, or remove the taint if it's blocking pods that should schedule there.