# How to Fix Elasticsearch Red Index Status
A red cluster status in Elasticsearch is serious. It means at least one primary shard and its replicas are missing, making some of your data inaccessible. Unlike yellow status, this requires immediate attention.
Recognizing the Problem
When you check cluster health, you'll see:
curl -X GET "localhost:9200/_cluster/health?pretty"{
"cluster_name" : "production-cluster",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 30,
"active_shards" : 50,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10
}The critical indicators are status: red and the presence of unassigned shards. Some queries against affected indices will fail completely.
Identifying Affected Indices
Find which indices are causing the red status:
curl -X GET "localhost:9200/_cat/indices?v&health=red"health status index pri rep docs.count docs.deleted store.size pri.store.size
red open orders-2024-01 5 1 0 0 260b 260b
red open customer-data 3 1 5000 0 15.2mb 15.2mbGet detailed information about unassigned shards:
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason,unassigned.description&s=state" | grep UNASSIGNEDUnderstanding Why Shards Are Unassigned
Use the cluster allocation explain API for root cause analysis:
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
"index": "orders-2024-01",
"shard": 0,
"primary": true
}
'The response reveals the specific issue:
{
"index" : "orders-2024-01",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2024-01-15T14:30:00.000Z",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt"
}Common reasons include:
- NODE_LEFT: The node hosting the shard left the cluster
- ALLOCATION_FAILED: Allocation attempts failed repeatedly
- NEW_INDEX_RESTORED: Index was restored but shards couldn't be allocated
- NODE_CRASHED: The node hosting the shard crashed
Recovery Strategy 1: Bring Missing Nodes Back
If a node dropped from the cluster, bringing it back often resolves the issue. Check which nodes are expected:
curl -X GET "localhost:9200/_cat/nodes?v"If you're missing nodes, restart them:
# On the missing node
systemctl restart elasticsearchAfter the node rejoins, verify:
curl -X GET "localhost:9200/_cluster/health?wait_for_status=yellow&timeout=60s"Recovery Strategy 2: Restore from Snapshot
If nodes cannot be recovered and shards are permanently lost, restore from a snapshot. First, check available snapshots:
curl -X GET "localhost:9200/_snapshot/my_backup/_all?pretty"Restore the affected index:
curl -X POST "localhost:9200/_snapshot/my_backup/snapshot_20240115/_restore" -H 'Content-Type: application/json' -d'
{
"indices": "orders-2024-01",
"ignore_unavailable": true,
"include_global_state": false
}
'If the index still exists but is corrupted, you may need to close it first:
```bash curl -X POST "localhost:9200/orders-2024-01/_close"
curl -X POST "localhost:9200/_snapshot/my_backup/snapshot_20240115/_restore" -H 'Content-Type: application/json' -d' { "indices": "orders-2024-01", "ignore_unavailable": true, "include_global_state": false } '
curl -X POST "localhost:9200/orders-2024-01/_open" ```
Recovery Strategy 3: Allocate Stale Primary
When no valid copy exists and you cannot restore from backup, you can allocate a stale primary. This is a last resort because it may result in data loss:
curl -X POST "localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d'
{
"commands": [
{
"allocate_stale_primary": {
"index": "orders-2024-01",
"shard": 0,
"node": "node-1",
"accept_data_loss": true
}
}
]
}
'The accept_data_loss: true flag is required, acknowledging that you understand data may be lost.
Recovery Strategy 4: Delete Corrupted Indices
If the index data is not critical and cannot be recovered, you can delete it:
curl -X DELETE "localhost:9200/orders-2024-01"This immediately returns the cluster to green status (assuming no other red indices). Use this approach only for non-critical data or when you have external backups.
Handling Corrupted Translog
Sometimes shards fail due to translog corruption. Try clearing the translog:
```bash curl -X POST "localhost:9200/orders-2024-01/_close"
curl -X PUT "localhost:9200/orders-2024-01/_settings" -H 'Content-Type: application/json' -d' { "index": { "translog": { "durability": "async" } } } '
curl -X POST "localhost:9200/orders-2024-01/_open" ```
Checking for Cluster Block
A red status sometimes triggers a cluster block that prevents writes:
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&flat_settings=true&pretty" | grep blockIf you see a write block, you can clear it:
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
"transient": {
"cluster.blocks.read_only_allow_delete": false
}
}
'Verification Steps
After recovery, verify the cluster is healthy:
```bash # Check overall health curl -X GET "localhost:9200/_cluster/health?wait_for_status=green&timeout=60s&pretty"
# Verify all shards are assigned curl -X GET "localhost:9200/_cat/shards?v&s=state" | grep -v STARTED
# Check specific index health curl -X GET "localhost:9200/_cat/indices/orders-2024-01?v" ```
Run a test query against the recovered index:
curl -X GET "localhost:9200/orders-2024-01/_search?size=1&pretty"Prevention Measures
To avoid red status in the future:
- 1.Maintain proper replica counts: Ensure at least one replica for production indices
- 2.Regular snapshots: Configure automated snapshots with a reliable schedule
- 3.Node monitoring: Alert on node departures immediately
- 4.Disk space management: Keep nodes below 85% disk usage
- 5.Multi-zone deployment: Distribute nodes across availability zones
Configure automated snapshots:
```bash # Register a snapshot repository curl -X PUT "localhost:9200/_snapshot/daily_backups" -H 'Content-Type: application/json' -d' { "type": "fs", "settings": { "location": "/mnt/backups/elasticsearch" } } '
# Create a snapshot lifecycle policy curl -X PUT "localhost:9200/_slm/policy/daily-snapshots" -H 'Content-Type: application/json' -d' { "schedule": "0 30 1 * * ?", "name": "<daily-snap-{now/d}>", "repository": "daily_backups", "config": { "indices": ["*"], "ignore_unavailable": true, "include_global_state": false } } ' ```
Regular monitoring and proactive maintenance will help you avoid the stress of red status incidents.