Introduction
Cilium identity exhaustion occurs when the Cilium agent cannot allocate new security identities for endpoints, preventing pod scheduling or network connectivity. Cilium assigns a unique security identity (numeric ID) to each endpoint based on its security labels (namespace, service account, pod labels). When identity pools are exhausted, endpoints enter not-ready state, pods cannot be scheduled, or network policies stop enforcing. Common causes include IPAM pool depletion, identity cache limits, stale identities not being garbage collected, or kvstore synchronization failures in cluster mode.
Symptoms
cilium statusshowsIdentity allocation: FAILEDorKVstore: KVStore is locked- Pods stuck in
ContainerCreatingorPendingstate - Cilium agent logs show
Error allocating identity,identity cache full, orno available IPs cilium endpoint listshows endpoints inwaiting-for-identitystate- New pods cannot communicate even after running
- Issue appears after scaling event, node failure, or Cilium upgrade
cilium-healthshows connectivity failures between nodes
Common Causes
- IPAM pool exhausted (no available IPs for new endpoints)
- Identity cache limit reached (
--identity-allocation-refreshmisconfigured) - Stale identities not garbage collected (dead endpoints holding identities)
- Kvstore (etcd/Consul) unreachable or synchronization failed
- Cluster running with
--enable-identity-cache=false(no caching) - Identity leak from pods with frequently changing labels
- Node identity conflict after cluster merge or restore
Step-by-Step Fix
### 1. Check Cilium cluster health status
Verify Cilium agent and datapath health:
```bash # Check Cilium daemonset status kubectl -n kube-system get ds cilium kubectl -n kube-system get pods -l k8s-app=cilium -o wide
# Check Cilium status on a node kubectl -n kube-system exec -it cilium-<node-name> -- cilium status
# Expected output: # KVStore: Ok Etcd: 3/3 connected # ContainerRuntime: Ok Kubernetes API server connectivity: Ok # Kubernetes: Ok 1.28 (v1.28.0) [linux/amd64] # Kubernetes controllers: Ok # NodeMonitor: Ok Monitoring disabled # Cilium: Ok 1.14.0 (minor version: 14) # Identity allocation: Ok Success # Hubble: Ok 1.14.0
# If identity allocation shows FAILED: # Identity allocation: FAILED Error: identity cache full ```
### 2. Check IPAM pool availability
IP address exhaustion is the most common cause:
```bash # Check IPAM status kubectl -n kube-system exec -it cilium-<node-name> -- cilium status --verbose
# Look for IPAM section: # IPAM: Ok IPv4: 254/254 allocated, 0 free
# Check allocated IPs per node kubectl -n kube-system exec -it cilium-<node-name> -- cilium ipam status
# Output shows: # IPv4 Allocations: # 10.0.1.1 ( allocated : pod:kube-system/coredns-xxx ) # 10.0.1.2 ( allocated : pod:default/myapp-xxx ) # ... # Total allocated: 254 # Total available: 0 ```
Check IPAM configuration:
```bash # Get Cilium config kubectl -n kube-system get configmap cilium-config -o yaml
# Check IPAM mode # ipam: "kubernetes" # Kubernetes native IPAM # ipam: "cluster-pool" # Cluster-wide IP pool # ipam: "eni" # AWS ENI mode # ipam: "azure" # Azure IPAM # ipam: "crd" # CiliumNode CRD
# For cluster-pool mode, check pool configuration kubectl -n kube-system get ciliumnodes kubectl get ciliumnetworkconfig ```
Free up IP addresses:
```bash # Delete stale pods (Completed, Error state) kubectl delete pods --field-selector=status.phase==Succeeded -A kubectl delete pods --field-selector=status.phase==Failed -A
# Drain and remove unreachable nodes kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data kubectl delete node <node-name>
# Restart Cilium to reclaim orphaned IPs kubectl -n kube-system rollout restart ds cilium ```
### 3. Check identity cache usage
Monitor identity allocation:
```bash # Count total identities kubectl -n kube-system exec -it cilium-<node-name> -- cilium identity list | wc -l
# List identities by security label kubectl -n kube-system exec -it cilium-<node-name> -- cilium identity list -o json | \ jq -r '.[].labels | sort | join(",")' | sort | uniq -c | sort -rn | head -20
# Check for identity leaks (many similar identities) kubectl -n kube-system exec -it cilium-<node-name> -- cilium identity list -o json | \ jq -r '.[] | select(.labels | length > 5) | .labels'
# Check identity cache statistics kubectl -n kube-system exec -it cilium-<node-name> -- cilium status --verbose | grep -A10 "Identity" ```
### 4. Garbage collect stale identities
Clean up unused identities:
```bash # Trigger identity garbage collection kubectl -n kube-system exec -it cilium-<node-name> -- cilium identity gc
# Check identities before and after kubectl -n kube-system exec -it cilium-<node-name> -- cilium identity list | wc -l
# Force cleanup of specific identity (use carefully) kubectl -n kube-system exec -it cilium-<node-name> -- \ cilium identity delete <identity-id>
# List identities with no associated endpoints kubectl -n kube-system exec -it cilium-<node-name> -- \ cilium identity list -o json | jq -r ' .[] | select(.refCount == 0) | "ID: \(.id), Labels: \(.labels)" ' ```
Automated stale identity cleanup:
```python #!/usr/bin/env python3 import subprocess import json import sys
def get_stale_identities(node): """Get identities with no active endpoints""" cmd = f"kubectl -n kube-system exec -it {node} -- cilium identity list -o json" result = subprocess.run(cmd, shell=True, capture_output=True, text=True) identities = json.loads(result.stdout)
stale = [] for identity in identities: if identity.get('refCount', 0) == 0: # Check if identity is older than 24 hours stale.append(identity)
return stale
def cleanup_stale(node, identities): """Remove stale identities""" for identity in identities[:100]: # Limit batch size cmd = f"kubectl -n kube-system exec -it {node} -- " \ f"cilium identity delete {identity['id']}" subprocess.run(cmd, shell=True) print(f"Deleted identity {identity['id']}")
if __name__ == "__main__": node = "cilium-abc123" stale = get_stale_identities(node) print(f"Found {len(stale)} stale identities") cleanup_stale(node, stale) ```
### 5. Check kvstore connectivity and synchronization
For cluster mode with etcd/Consul:
```bash # Check kvstore status kubectl -n kube-system exec -it cilium-<node-name> -- cilium kvstore status
# Expected: # kvstore: Ok # kvstore configuration: # - etcd: https://etcd-client:2379 # - lockKeyPrefix: cilium-lock # - prefix: cilium-state
# Check kvstore connectivity kubectl -n kube-system exec -it cilium-<node-name> -- \ cilium-service get kvstore
# Check for kvstore locks kubectl -n kube-system exec -it cilium-<node-name> -- \ cilium kvstore lock list
# Release stale locks kubectl -n kube-system exec -it cilium-<node-name> -- \ cilium kvstore lock delete <lock-key> ```
Etcd health check:
```bash # Check etcd cluster health kubectl -n kube-system exec -it etcd-0 -- etcdctl endpoint health kubectl -n kube-system exec -it etcd-0 -- etcdctl endpoint status --write-out=table
# Check etcd storage usage kubectl -n kube-system exec -it etcd-0 -- etcdctl endpoint status --write-out=json | \ jq '.[] | {endpoint, dbSize, dbSizeInUse}'
# If etcd storage > 8GB, compact and defragment kubectl -n kube-system exec -it etcd-0 -- etcdctl compact $(etcdctl endpoint status --write-out="json" | jq -r '.[0].revision') kubectl -n kube-system exec -it etcd-0 -- etcdctl defrag ```
### 6. Check endpoint allocation and cleanup
List all endpoints and identify issues:
```bash # List all endpoints across cluster kubectl -n kube-system exec -it cilium-<node-name> -- cilium endpoint list
# Check for endpoints in bad state kubectl -n kube-system exec -it cilium-<node-name> -- \ cilium endpoint list -o json | jq -r ' .[] | select(.state != "ready") | "ID: \(.id), State: \(.state), Name: \(.status.external.identifiers.containerName)" '
# Delete stale endpoints kubectl -n kube-system exec -it cilium-<node-name> -- \ cilium endpoint delete <endpoint-id>
# Garbage collect all stale endpoints kubectl -n kube-system exec -it cilium-<node-name> -- cilium endpoint gc ```
### 7. Increase identity cache limits
If running into cache limits:
```bash # Edit Cilium configuration kubectl -n kube-system edit configmap cilium-config
# Add or modify: # data: # # Increase identity cache size (default: 100000) # identity-allocation-refresh: "5" # identity-gc-interval: "600" # # For etcd mode, increase kvstore lease TTL # kvstore-lease-ttl: "120s"
# Restart Cilium to apply kubectl -n kube-system rollout restart ds cilium ```
Helm chart configuration:
```yaml # values.yaml identityAllocation: mode: "kubernetes" # or "kvstore" gcInterval: 600s # Run GC every 10 minutes
kvstore: enabled: true etcd: enabled: true clusterSize: 3 extraArgs: - --quota-backend-bytes=8589934592 # 8GB
ipam: mode: "cluster-pool" operator: clusterPoolIPv4PodCIDRList: - "10.0.0.0/16" # Increase pool size ```
### 8. Check for identity label explosion
Frequently changing labels create identity leaks:
```bash # Check for pods with many unique label combinations kubectl get pods -A -o json | jq -r ' .items[] | .metadata.labels | to_entries | map("\(.key)=\(.value)") | sort | join(",") ' | sort | uniq -c | sort -rn | head -20
# Check Cilium identity distribution kubectl -n kube-system exec -it cilium-<node-name> -- \ cilium identity list -o json | jq -r ' group_by(.labels | sort | join(",")) | map({count: length, labels: .[0].labels}) | sort_by(-.count) | .[0:20] ' ```
Fix identity explosion:
```yaml # WRONG: Pod template with frequently changing labels spec: template: metadata: labels: app: myapp pod-template-hash: "{{ randAlphaNum 5 }}" # Changes every deploy! deploy-time: "{{ now }}" # Unique per deploy
# CORRECT: Stable labels for identity spec: template: metadata: labels: app: myapp version: v1.2.3 # Changes only on version bump ```
### 9. Monitor identity allocation metrics
Set up Prometheus monitoring:
```yaml # Prometheus recording rules groups: - name: cilium_identity rules: - record: cilium:identity_count:sum expr: sum(cilium_identity_count)
- record: cilium:identity_allocation_rate:rate5m
- expr: rate(cilium_identity_allocation_count[5m])
- record: cilium:ipam_available_ratio
- expr: cilium_ipam_available_ips / cilium_ipam_total_ips
`
Key metrics to alert:
- cilium_identity_count > 80% of max: Warning
- cilium_ipam_available_ratio < 0.1: Critical
- cilium_endpoint_state{state="not-ready"} > 0: Investigate
Grafana dashboard panels: - Identity count over time - IPAM pool utilization - Endpoint state distribution - Identity allocation rate
### 10. Enable identity GC and automated cleanup
Configure automatic garbage collection:
```bash # Patch Cilium configmap kubectl -n kube-system patch configmap cilium-config --type merge -p '{ "data": { "identity-gc-interval": "300", "endpoint-gc-interval": "300", "enable-endpoint-gc": "true" } }'
# Restart Cilium kubectl -n kube-system rollout restart ds cilium ```
CronJob for proactive cleanup:
yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: cilium-identity-cleanup
namespace: kube-system
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
spec:
template:
spec:
serviceAccountName: cilium
containers:
- name: cleanup
image: cilium/cilium:latest
command:
- /bin/sh
- -c
- |
NODE=$(kubectl -n kube-system get pod -l k8s-app=cilium -o jsonpath='{.items[0].metadata.name}')
kubectl -n kube-system exec $NODE -- cilium identity gc
kubectl -n kube-system exec $NODE -- cilium endpoint gc
restartPolicy: OnFailure
Prevention
- Monitor IPAM pool utilization and expand before exhaustion
- Configure identity GC interval appropriate for workload churn
- Avoid pod labels that change frequently (timestamps, random strings)
- Use stable label values for security identity grouping
- Set up alerts for identity count and IPAM availability
- Regularly audit and remove stale nodes from cluster
- Test Cilium upgrade procedures in staging with identity stress testing
Related Errors
- **No available IPs**: IPAM pool exhausted
- **KVStore not ready**: etcd/Consul connectivity failed
- **Endpoint not ready**: Identity allocation pending
- **Identity conflict**: Duplicate identity detected