Introduction A red cluster status means at least one primary shard is unallocated, making some data completely unavailable. This is more severe than yellow (missing replicas) and requires immediate action. Common causes include permanent node loss, corrupted shard data, or allocation rules preventing shard assignment.

Symptoms - `GET /_cluster/health` returns `"status": "red"` - `GET /_cat/shards?v | grep UNASSIGNED` shows primary shards unassigned - Search queries on affected indices return partial or no results - `GET /_cluster/allocation/explain` shows primary cannot be allocated - Some indices show `UNASSIGNED` with `store` size of 0 bytes

Common Causes - Node with primary shard data permanently lost (disk failure, instance terminated) - All replicas were also on the same failed node (single-point-of-failure) - Index corruption preventing shard loading - Allocation rules (`require`, `exclude`) preventing shard placement - Snapshot restore interrupted, leaving index in incomplete state

Step-by-Step Fix 1. **Identify the unassigned primary shards": ```bash curl -s localhost:9200/_cat/shards?v | grep UNASSIGNED | grep p # Format: index shard prirep state docs store ip node ```

  1. 1.**Get detailed allocation explanation":
  2. 2.```bash
  3. 3.curl -s localhost:9200/_cluster/allocation/explain?pretty
  4. 4.# Look for the specific reason the primary cannot be allocated
  5. 5.`
  6. 6.**If data is permanently lost, allocate empty primaries":
  7. 7.```bash
  8. 8.curl -X POST localhost:9200/_cluster/reroute -H 'Content-Type: application/json' -d '{
  9. 9."commands": [{
  10. 10."allocate_empty_primary": {
  11. 11."index": "logs-2026.04.01",
  12. 12."shard": 0,
  13. 13."node": "data-node-1",
  14. 14."accept_data_loss": true
  15. 15.}
  16. 16.}]
  17. 17.}'
  18. 18.`
  19. 19.**For all affected indices at once, use the Reroute API":
  20. 20.```bash
  21. 21.# Get all unassigned primaries
  22. 22.curl -s localhost:9200/_cat/shards | grep UNASSIGNED | grep p | awk '{print $1, $2}' | \
  23. 23.while read index shard; do
  24. 24.node=$(curl -s localhost:9200/_cat/nodes?h=name | head -1)
  25. 25.curl -X POST localhost:9200/_cluster/reroute -H 'Content-Type: application/json' -d "{
  26. 26.\"commands\": [{
  27. 27.\"allocate_empty_primary\": {
  28. 28.\"index\": \"$index\",
  29. 29.\"shard\": $shard,
  30. 30.\"node\": \"$node\",
  31. 31.\"accept_data_loss\": true
  32. 32.}
  33. 33.}]
  34. 34.}"
  35. 35.done
  36. 36.`
  37. 37.**Recover from snapshot if data is critical":
  38. 38.```bash
  39. 39.# List available snapshots
  40. 40.curl -s localhost:9200/_snapshot/my_repo/_all?pretty

# Close the broken index curl -X POST localhost:9200/logs-2026.04.01/_close

# Restore from snapshot curl -X POST localhost:9200/_snapshot/my_repo/snapshot_20260401/_restore -H 'Content-Type: application/json' -d '{ "indices": "logs-2026.04.01", "ignore_unavailable": true }' ```

Prevention - Always set `number_of_replicas` to at least 1 for production indices - Use `cluster.routing.allocation.same_shard.host: true` to prevent primary and replica on same host - Monitor cluster health with alerting on status changes to yellow or red - Implement automated snapshots with ILM (Index Lifecycle Management) - Use at least 3 data nodes to tolerate single node loss - Set up shard allocation awareness to distribute replicas across failure zones - Regularly test disaster recovery by simulating node failures