Fix Elasticsearch Yellow Status - Complete Troubleshooting Guide

# How to Fix Elasticsearch Yellow Cluster Status

You've checked your Elasticsearch cluster health and noticed it's showing yellow instead of green. While your data is still accessible, this status indicates something isn't quite right with your shard allocation.

Understanding Yellow Status

A yellow cluster status means that all primary shards are assigned and functioning, but at least one replica shard is unassigned. This isn't a critical failure like red status, but it does mean you've lost your redundancy for some indices.

Here's what you'll typically see when running a health check:

bash

curl -X GET "localhost:9200/_cluster/health?pretty"

The response shows the yellow status:

json

{
  "cluster_name" : "production-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 15,
  "active_shards" : 15,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 15,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

Common Causes

The yellow status typically occurs for these reasons:

1.Single-node cluster: Replica shards cannot be assigned because they're configured to exist but there's nowhere to put them
2.Node failures: Some nodes left the cluster, leaving replicas without homes
3.Disk space issues: Nodes don't have enough disk space for replica allocation
4.Allocation settings: Shard allocation has been disabled or restricted

Diagnosing the Issue

First, identify which indices have unassigned shards:

bash

curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state"

This will show you something like:

bash

index              shard prirep state      unassigned.reason
logs-2024-01       0     r      UNASSIGNED ALLOCATION_FAILED
logs-2024-01       1     r      UNASSIGNED ALLOCATION_FAILED
products          0     r      UNASSIGNED NODE_LEFT

For more detailed information about why shards aren't being assigned:

bash

curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

This returns detailed diagnostics:

json

{
  "index" : "logs-2024-01",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2024-01-15T10:30:00.000Z",
    "failed_attempts" : 5,
    "details" : "failed to create shard [...]",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes"
}

Solution 1: Single Node Cluster

If you're running a single-node cluster for development or testing, the simplest solution is to reduce the replica count to zero:

bash

curl -X PUT "localhost:9200/_all/_settings" -H 'Content-Type: application/json' -d'
{
  "index": {
    "number_of_replicas": 0
  }
}
'

To apply this to future indices, update your index templates:

bash

curl -X PUT "localhost:9200/_template/default_replicas" -H 'Content-Type: application/json' -d'
{
  "index_patterns": ["*"],
  "settings": {
    "number_of_replicas": 0
  }
}
'

Solution 2: Fix Allocation Settings

Check if shard allocation has been disabled:

bash

curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&flat_settings=true&pretty"

Look for cluster.routing.allocation.enabled. If it's set to none, re-enable it:

bash

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.enabled": "all"
  }
}
'

Solution 3: Address Disk Space

Check disk usage across your nodes:

bash

curl -X GET "localhost:9200/_cat/allocation?v"

If nodes are above the flood stage watermark (95% by default), you'll need to free up space or add nodes. The disk watermarks are:

cluster.routing.allocation.disk.watermark.low: 85% - stops new shard allocation
cluster.routing.allocation.disk.watermark.high: 90% - attempts to relocate shards
cluster.routing.allocation.disk.watermark.flood_stage: 95% - blocks index writes

You can temporarily adjust these settings:

bash

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "98%"
  }
}
'

However, the better approach is to add more disk capacity or delete old indices.

Solution 4: Reroute Stuck Shards

Sometimes shards get stuck in an unassigned state. You can manually reroute them:

bash

curl -X POST "localhost:9200/_cluster/reroute?retry_failed=true" -H 'Content-Type: application/json' -d'
{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "logs-2024-01",
        "shard": 0,
        "node": "node-1",
        "accept_data_loss": true
      }
    }
  ]
}
'

For replica shards, use allocate_replica:

bash

curl -X POST "localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d'
{
  "commands": [
    {
      "allocate_replica": {
        "index": "logs-2024-01",
        "shard": 0,
        "node": "node-2"
      }
    }
  ]
}
'

Verifying the Fix

After applying your solution, verify cluster health:

bash

curl -X GET "localhost:9200/_cluster/health?wait_for_status=green&timeout=30s&pretty"

Check the shard allocation status:

bash

curl -X GET "localhost:9200/_cat/shards?v&s=state"

You should see all shards showing STARTED status and no UNASSIGNED entries.

Prevention

To prevent yellow status from recurring:

1.Monitor node count: Ensure you have enough nodes to accommodate your replica configuration
2.Set up alerts: Configure alerts for yellow status changes
3.Plan capacity: Monitor disk usage and add capacity before hitting watermarks
4.Use ILM: Implement Index Lifecycle Management to handle index aging and cleanup

Set up a basic alert using Elasticsearch's watch feature or integrate with tools like Prometheus and Grafana for ongoing monitoring.

How to Fix Elasticsearch Yellow Cluster Status

Understanding Yellow Status

Common Causes

Diagnosing the Issue

Solution 1: Single Node Cluster

Solution 2: Fix Allocation Settings

Solution 3: Address Disk Space

Solution 4: Reroute Stuck Shards

Verifying the Fix

Prevention

Share this guide

More Monitoring Troubleshooting Guides

Metric Retention Expired

Timeseries Storage Full

Collector Agent Crashed

Webhook Notification Timeout

SMS Notification Failed

Email Notification Bounced