# How to Fix Elasticsearch Red Index Status

A red cluster status in Elasticsearch is serious. It means at least one primary shard and its replicas are missing, making some of your data inaccessible. Unlike yellow status, this requires immediate attention.

Recognizing the Problem

When you check cluster health, you'll see:

bash
curl -X GET "localhost:9200/_cluster/health?pretty"
json
{
  "cluster_name" : "production-cluster",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 30,
  "active_shards" : 50,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 10
}

The critical indicators are status: red and the presence of unassigned shards. Some queries against affected indices will fail completely.

Identifying Affected Indices

Find which indices are causing the red status:

bash
curl -X GET "localhost:9200/_cat/indices?v&health=red"
bash
health status index              pri rep docs.count docs.deleted store.size pri.store.size
red    open   orders-2024-01       5   1          0            0       260b           260b
red    open   customer-data       3   1       5000            0     15.2mb          15.2mb

Get detailed information about unassigned shards:

bash
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason,unassigned.description&s=state" | grep UNASSIGNED

Understanding Why Shards Are Unassigned

Use the cluster allocation explain API for root cause analysis:

bash
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
  "index": "orders-2024-01",
  "shard": 0,
  "primary": true
}
'

The response reveals the specific issue:

json
{
  "index" : "orders-2024-01",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2024-01-15T14:30:00.000Z",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt"
}

Common reasons include:

  • NODE_LEFT: The node hosting the shard left the cluster
  • ALLOCATION_FAILED: Allocation attempts failed repeatedly
  • NEW_INDEX_RESTORED: Index was restored but shards couldn't be allocated
  • NODE_CRASHED: The node hosting the shard crashed

Recovery Strategy 1: Bring Missing Nodes Back

If a node dropped from the cluster, bringing it back often resolves the issue. Check which nodes are expected:

bash
curl -X GET "localhost:9200/_cat/nodes?v"

If you're missing nodes, restart them:

bash
# On the missing node
systemctl restart elasticsearch

After the node rejoins, verify:

bash
curl -X GET "localhost:9200/_cluster/health?wait_for_status=yellow&timeout=60s"

Recovery Strategy 2: Restore from Snapshot

If nodes cannot be recovered and shards are permanently lost, restore from a snapshot. First, check available snapshots:

bash
curl -X GET "localhost:9200/_snapshot/my_backup/_all?pretty"

Restore the affected index:

bash
curl -X POST "localhost:9200/_snapshot/my_backup/snapshot_20240115/_restore" -H 'Content-Type: application/json' -d'
{
  "indices": "orders-2024-01",
  "ignore_unavailable": true,
  "include_global_state": false
}
'

If the index still exists but is corrupted, you may need to close it first:

```bash curl -X POST "localhost:9200/orders-2024-01/_close"

curl -X POST "localhost:9200/_snapshot/my_backup/snapshot_20240115/_restore" -H 'Content-Type: application/json' -d' { "indices": "orders-2024-01", "ignore_unavailable": true, "include_global_state": false } '

curl -X POST "localhost:9200/orders-2024-01/_open" ```

Recovery Strategy 3: Allocate Stale Primary

When no valid copy exists and you cannot restore from backup, you can allocate a stale primary. This is a last resort because it may result in data loss:

bash
curl -X POST "localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d'
{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "orders-2024-01",
        "shard": 0,
        "node": "node-1",
        "accept_data_loss": true
      }
    }
  ]
}
'

The accept_data_loss: true flag is required, acknowledging that you understand data may be lost.

Recovery Strategy 4: Delete Corrupted Indices

If the index data is not critical and cannot be recovered, you can delete it:

bash
curl -X DELETE "localhost:9200/orders-2024-01"

This immediately returns the cluster to green status (assuming no other red indices). Use this approach only for non-critical data or when you have external backups.

Handling Corrupted Translog

Sometimes shards fail due to translog corruption. Try clearing the translog:

```bash curl -X POST "localhost:9200/orders-2024-01/_close"

curl -X PUT "localhost:9200/orders-2024-01/_settings" -H 'Content-Type: application/json' -d' { "index": { "translog": { "durability": "async" } } } '

curl -X POST "localhost:9200/orders-2024-01/_open" ```

Checking for Cluster Block

A red status sometimes triggers a cluster block that prevents writes:

bash
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&flat_settings=true&pretty" | grep block

If you see a write block, you can clear it:

bash
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.blocks.read_only_allow_delete": false
  }
}
'

Verification Steps

After recovery, verify the cluster is healthy:

```bash # Check overall health curl -X GET "localhost:9200/_cluster/health?wait_for_status=green&timeout=60s&pretty"

# Verify all shards are assigned curl -X GET "localhost:9200/_cat/shards?v&s=state" | grep -v STARTED

# Check specific index health curl -X GET "localhost:9200/_cat/indices/orders-2024-01?v" ```

Run a test query against the recovered index:

bash
curl -X GET "localhost:9200/orders-2024-01/_search?size=1&pretty"

Prevention Measures

To avoid red status in the future:

  1. 1.Maintain proper replica counts: Ensure at least one replica for production indices
  2. 2.Regular snapshots: Configure automated snapshots with a reliable schedule
  3. 3.Node monitoring: Alert on node departures immediately
  4. 4.Disk space management: Keep nodes below 85% disk usage
  5. 5.Multi-zone deployment: Distribute nodes across availability zones

Configure automated snapshots:

```bash # Register a snapshot repository curl -X PUT "localhost:9200/_snapshot/daily_backups" -H 'Content-Type: application/json' -d' { "type": "fs", "settings": { "location": "/mnt/backups/elasticsearch" } } '

# Create a snapshot lifecycle policy curl -X PUT "localhost:9200/_slm/policy/daily-snapshots" -H 'Content-Type: application/json' -d' { "schedule": "0 30 1 * * ?", "name": "<daily-snap-{now/d}>", "repository": "daily_backups", "config": { "indices": ["*"], "ignore_unavailable": true, "include_global_state": false } } ' ```

Regular monitoring and proactive maintenance will help you avoid the stress of red status incidents.