Fix Elasticsearch Red Index Status - Emergency Recovery Guide

# How to Fix Elasticsearch Red Index Status

A red cluster status in Elasticsearch is serious. It means at least one primary shard and its replicas are missing, making some of your data inaccessible. Unlike yellow status, this requires immediate attention.

Recognizing the Problem

When you check cluster health, you'll see:

bash

curl -X GET "localhost:9200/_cluster/health?pretty"

json

{
  "cluster_name" : "production-cluster",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 30,
  "active_shards" : 50,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 10
}

The critical indicators are status: red and the presence of unassigned shards. Some queries against affected indices will fail completely.

Identifying Affected Indices

Find which indices are causing the red status:

bash

curl -X GET "localhost:9200/_cat/indices?v&health=red"

bash

health status index              pri rep docs.count docs.deleted store.size pri.store.size
red    open   orders-2024-01       5   1          0            0       260b           260b
red    open   customer-data       3   1       5000            0     15.2mb          15.2mb

Get detailed information about unassigned shards:

bash

curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason,unassigned.description&s=state" | grep UNASSIGNED

Understanding Why Shards Are Unassigned

Use the cluster allocation explain API for root cause analysis:

bash

curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
  "index": "orders-2024-01",
  "shard": 0,
  "primary": true
}
'

The response reveals the specific issue:

json

{
  "index" : "orders-2024-01",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2024-01-15T14:30:00.000Z",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt"
}

Common reasons include:

NODE_LEFT: The node hosting the shard left the cluster
ALLOCATION_FAILED: Allocation attempts failed repeatedly
NEW_INDEX_RESTORED: Index was restored but shards couldn't be allocated
NODE_CRASHED: The node hosting the shard crashed

Recovery Strategy 1: Bring Missing Nodes Back

If a node dropped from the cluster, bringing it back often resolves the issue. Check which nodes are expected:

bash

curl -X GET "localhost:9200/_cat/nodes?v"

If you're missing nodes, restart them:

bash

# On the missing node
systemctl restart elasticsearch

After the node rejoins, verify:

bash

curl -X GET "localhost:9200/_cluster/health?wait_for_status=yellow&timeout=60s"

Recovery Strategy 2: Restore from Snapshot

If nodes cannot be recovered and shards are permanently lost, restore from a snapshot. First, check available snapshots:

bash

curl -X GET "localhost:9200/_snapshot/my_backup/_all?pretty"

Restore the affected index:

bash

curl -X POST "localhost:9200/_snapshot/my_backup/snapshot_20240115/_restore" -H 'Content-Type: application/json' -d'
{
  "indices": "orders-2024-01",
  "ignore_unavailable": true,
  "include_global_state": false
}
'

If the index still exists but is corrupted, you may need to close it first:

```bash curl -X POST "localhost:9200/orders-2024-01/_close"

curl -X POST "localhost:9200/_snapshot/my_backup/snapshot_20240115/_restore" -H 'Content-Type: application/json' -d' { "indices": "orders-2024-01", "ignore_unavailable": true, "include_global_state": false } '

curl -X POST "localhost:9200/orders-2024-01/_open" ```

Recovery Strategy 3: Allocate Stale Primary

When no valid copy exists and you cannot restore from backup, you can allocate a stale primary. This is a last resort because it may result in data loss:

bash

curl -X POST "localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d'
{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "orders-2024-01",
        "shard": 0,
        "node": "node-1",
        "accept_data_loss": true
      }
    }
  ]
}
'

The accept_data_loss: true flag is required, acknowledging that you understand data may be lost.

Recovery Strategy 4: Delete Corrupted Indices

If the index data is not critical and cannot be recovered, you can delete it:

bash

curl -X DELETE "localhost:9200/orders-2024-01"

This immediately returns the cluster to green status (assuming no other red indices). Use this approach only for non-critical data or when you have external backups.

Handling Corrupted Translog

Sometimes shards fail due to translog corruption. Try clearing the translog:

```bash curl -X POST "localhost:9200/orders-2024-01/_close"

curl -X PUT "localhost:9200/orders-2024-01/_settings" -H 'Content-Type: application/json' -d' { "index": { "translog": { "durability": "async" } } } '

curl -X POST "localhost:9200/orders-2024-01/_open" ```

Checking for Cluster Block

A red status sometimes triggers a cluster block that prevents writes:

bash

curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&flat_settings=true&pretty" | grep block

If you see a write block, you can clear it:

bash

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.blocks.read_only_allow_delete": false
  }
}
'

Verification Steps

After recovery, verify the cluster is healthy:

```bash # Check overall health curl -X GET "localhost:9200/_cluster/health?wait_for_status=green&timeout=60s&pretty"

# Verify all shards are assigned curl -X GET "localhost:9200/_cat/shards?v&s=state" | grep -v STARTED

# Check specific index health curl -X GET "localhost:9200/_cat/indices/orders-2024-01?v" ```

Run a test query against the recovered index:

bash

curl -X GET "localhost:9200/orders-2024-01/_search?size=1&pretty"

Prevention Measures

To avoid red status in the future:

1.Maintain proper replica counts: Ensure at least one replica for production indices
2.Regular snapshots: Configure automated snapshots with a reliable schedule
3.Node monitoring: Alert on node departures immediately
4.Disk space management: Keep nodes below 85% disk usage
5.Multi-zone deployment: Distribute nodes across availability zones

Configure automated snapshots:

```bash # Register a snapshot repository curl -X PUT "localhost:9200/_snapshot/daily_backups" -H 'Content-Type: application/json' -d' { "type": "fs", "settings": { "location": "/mnt/backups/elasticsearch" } } '

# Create a snapshot lifecycle policy curl -X PUT "localhost:9200/_slm/policy/daily-snapshots" -H 'Content-Type: application/json' -d' { "schedule": "0 30 1 * * ?", "name": "<daily-snap-{now/d}>", "repository": "daily_backups", "config": { "indices": ["*"], "ignore_unavailable": true, "include_global_state": false } } ' ```

Regular monitoring and proactive maintenance will help you avoid the stress of red status incidents.

How to Fix Elasticsearch Red Index Status

Recognizing the Problem

Identifying Affected Indices

Understanding Why Shards Are Unassigned

Recovery Strategy 1: Bring Missing Nodes Back

Recovery Strategy 2: Restore from Snapshot

Recovery Strategy 3: Allocate Stale Primary

Recovery Strategy 4: Delete Corrupted Indices

Handling Corrupted Translog

Checking for Cluster Block

Verification Steps

Prevention Measures

Share this guide

More Monitoring Troubleshooting Guides

Metric Retention Expired

Timeseries Storage Full

Collector Agent Crashed

Webhook Notification Timeout

SMS Notification Failed

Email Notification Bounced