Introduction etcd is the backing store for all Kubernetes cluster data. When etcd becomes unhealthy, the API server cannot read or write cluster state, causing cascading failures across all control plane operations.

Symptoms - `etcdctl endpoint health` shows unhealthy members - API server slow or unresponsive - `kubectl` commands timeout or return errors - kube-scheduler and kube-controller-manager cannot update state - Cluster events show: "etcdserver: request timed out"

Common Causes - etcd disk I/O too slow (fsync latency > 10ms) - etcd member crashed and cannot rejoin - Network partition between etcd members - etcd database size exceeding recommended limits - Disk full on etcd nodes

Step-by-Step Fix 1. **Check etcd health**: ```bash ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ endpoint health ```

  1. 1.Check etcd database size:
  2. 2.```bash
  3. 3.ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  4. 4.--cacert=/etc/kubernetes/pki/etcd/ca.crt \
  5. 5.--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  6. 6.--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  7. 7.endpoint status --write-out=table
  8. 8.`
  9. 9.Compact and defragment etcd:
  10. 10.```bash
  11. 11.# Compact
  12. 12.REV=$(ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  13. 13.--cacert=/etc/kubernetes/pki/etcd/ca.crt \
  14. 14.--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  15. 15.--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  16. 16.endpoint status --write-out="json" | jq -r '.Status[].Header.revision')
  17. 17.ETCDCTL_API=3 etcdctl compact $REV

# Defragment ETCDCTL_API=3 etcdctl defrag ```

  1. 1.Restore from snapshot (if member lost):
  2. 2.```bash
  3. 3.ETCDCTL_API=3 etcdctl snapshot restore /var/lib/etcd/snapshot.db \
  4. 4.--data-dir=/var/lib/etcd-new --name=etcd-1 \
  5. 5.--initial-cluster="etcd-1=https://10.0.1.10:2380" \
  6. 6.--initial-advertise-peer-urls=https://10.0.1.10:2380
  7. 7.`

Prevention - Use SSDs for etcd data volumes (IOPS > 1000) - Monitor etcd disk fsync duration (alert if > 10ms) - Set up regular etcd snapshots (every 30 minutes) - Keep etcd database size under 8 GB (compact regularly) - Run etcd on dedicated nodes, not shared with workloads