# How to Fix Elasticsearch Yellow Cluster Status

You've checked your Elasticsearch cluster health and noticed it's showing yellow instead of green. While your data is still accessible, this status indicates something isn't quite right with your shard allocation.

Understanding Yellow Status

A yellow cluster status means that all primary shards are assigned and functioning, but at least one replica shard is unassigned. This isn't a critical failure like red status, but it does mean you've lost your redundancy for some indices.

Here's what you'll typically see when running a health check:

bash
curl -X GET "localhost:9200/_cluster/health?pretty"

The response shows the yellow status:

json
{
  "cluster_name" : "production-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 15,
  "active_shards" : 15,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 15,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

Common Causes

The yellow status typically occurs for these reasons:

  1. 1.Single-node cluster: Replica shards cannot be assigned because they're configured to exist but there's nowhere to put them
  2. 2.Node failures: Some nodes left the cluster, leaving replicas without homes
  3. 3.Disk space issues: Nodes don't have enough disk space for replica allocation
  4. 4.Allocation settings: Shard allocation has been disabled or restricted

Diagnosing the Issue

First, identify which indices have unassigned shards:

bash
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state"

This will show you something like:

bash
index              shard prirep state      unassigned.reason
logs-2024-01       0     r      UNASSIGNED ALLOCATION_FAILED
logs-2024-01       1     r      UNASSIGNED ALLOCATION_FAILED
products          0     r      UNASSIGNED NODE_LEFT

For more detailed information about why shards aren't being assigned:

bash
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

This returns detailed diagnostics:

json
{
  "index" : "logs-2024-01",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2024-01-15T10:30:00.000Z",
    "failed_attempts" : 5,
    "details" : "failed to create shard [...]",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes"
}

Solution 1: Single Node Cluster

If you're running a single-node cluster for development or testing, the simplest solution is to reduce the replica count to zero:

bash
curl -X PUT "localhost:9200/_all/_settings" -H 'Content-Type: application/json' -d'
{
  "index": {
    "number_of_replicas": 0
  }
}
'

To apply this to future indices, update your index templates:

bash
curl -X PUT "localhost:9200/_template/default_replicas" -H 'Content-Type: application/json' -d'
{
  "index_patterns": ["*"],
  "settings": {
    "number_of_replicas": 0
  }
}
'

Solution 2: Fix Allocation Settings

Check if shard allocation has been disabled:

bash
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&flat_settings=true&pretty"

Look for cluster.routing.allocation.enabled. If it's set to none, re-enable it:

bash
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.enabled": "all"
  }
}
'

Solution 3: Address Disk Space

Check disk usage across your nodes:

bash
curl -X GET "localhost:9200/_cat/allocation?v"

If nodes are above the flood stage watermark (95% by default), you'll need to free up space or add nodes. The disk watermarks are:

  • cluster.routing.allocation.disk.watermark.low: 85% - stops new shard allocation
  • cluster.routing.allocation.disk.watermark.high: 90% - attempts to relocate shards
  • cluster.routing.allocation.disk.watermark.flood_stage: 95% - blocks index writes

You can temporarily adjust these settings:

bash
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "98%"
  }
}
'

However, the better approach is to add more disk capacity or delete old indices.

Solution 4: Reroute Stuck Shards

Sometimes shards get stuck in an unassigned state. You can manually reroute them:

bash
curl -X POST "localhost:9200/_cluster/reroute?retry_failed=true" -H 'Content-Type: application/json' -d'
{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "logs-2024-01",
        "shard": 0,
        "node": "node-1",
        "accept_data_loss": true
      }
    }
  ]
}
'

For replica shards, use allocate_replica:

bash
curl -X POST "localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d'
{
  "commands": [
    {
      "allocate_replica": {
        "index": "logs-2024-01",
        "shard": 0,
        "node": "node-2"
      }
    }
  ]
}
'

Verifying the Fix

After applying your solution, verify cluster health:

bash
curl -X GET "localhost:9200/_cluster/health?wait_for_status=green&timeout=30s&pretty"

Check the shard allocation status:

bash
curl -X GET "localhost:9200/_cat/shards?v&s=state"

You should see all shards showing STARTED status and no UNASSIGNED entries.

Prevention

To prevent yellow status from recurring:

  1. 1.Monitor node count: Ensure you have enough nodes to accommodate your replica configuration
  2. 2.Set up alerts: Configure alerts for yellow status changes
  3. 3.Plan capacity: Monitor disk usage and add capacity before hitting watermarks
  4. 4.Use ILM: Implement Index Lifecycle Management to handle index aging and cleanup

Set up a basic alert using Elasticsearch's watch feature or integrate with tools like Prometheus and Grafana for ongoing monitoring.