What's Actually Happening
Your OpenSearch cluster shows red status, indicating some primary shards are not allocated. This means some data is unavailable and search operations on affected indices will fail.
The Error You'll See
Cluster health red:
```bash $ curl -XGET 'http://localhost:9200/_cluster/health?pretty'
{ "cluster_name" : "my-cluster", "status" : "red", "number_of_nodes" : 3, "number_of_data_nodes" : 3, "active_primary_shards" : 45, "active_shards" : 90, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 5, "delayed_unassigned_shards" : 0 } ```
Unassigned primary shards:
```bash $ curl -XGET 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason'
index shard prirep state unassigned.reason myindex 0 p UNASSIGNED NODE_LEFT myindex 1 p UNASSIGNED ALLOCATION_FAILED ```
Why This Happens
- 1.Node failure - Node hosting primary shard left cluster
- 2.Disk space - Node exceeded disk watermark
- 3.Allocation failure - Shard allocation failed repeatedly
- 4.Configuration error - Shard allocation settings prevent assignment
- 5.Network partition - Nodes cannot communicate
- 6.Corrupted shard - Shard data corruption
- 7.JVM OOM - Node crashed due to memory
- 8.Version mismatch - Incompatible OpenSearch versions
Step 1: Diagnose Cluster Status
```bash # Get cluster health: curl -XGET 'http://localhost:9200/_cluster/health?pretty'
# Get detailed cluster state: curl -XGET 'http://localhost:9200/_cluster/state?pretty'
# Check unassigned shards: curl -XGET 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED
# Get allocation explanation: curl -XGET 'http://localhost:9200/_cluster/allocation/explain?pretty'
# Check node stats: curl -XGET 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,ram.percent,disk.used_percent'
# Check disk space: curl -XGET 'http://localhost:9200/_cat/allocation?v' ```
Step 2: Identify Unassigned Shard Reason
```bash # Get detailed explanation for unassigned shards: curl -XGET 'http://localhost:9200/_cluster/allocation/explain?pretty' -d '{ "index": "myindex", "shard": 0, "primary": true }'
# Common reasons: # - NODE_LEFT: Node left cluster # - ALLOCATION_FAILED: Allocation failed multiple times # - CLUSTER_RECOVERED: Cluster recovery in progress # - REINITIALIZED: Shard reinitialized # - DANGLING_INDEX_IMPORTED: Dangling index
# Check allocation settings: curl -XGET 'http://localhost:9200/_cluster/settings?include_defaults=true&flat_settings=true&pretty'
# Check for disk watermarks: curl -XGET 'http://localhost:9200/_cluster/settings?pretty' | grep -A 5 watermark ```
Step 3: Fix Disk Space Issues
```bash # Check disk usage per node: curl -XGET 'http://localhost:9200/_cat/allocation?v'
# If disk > 85% (default flood stage watermark): # Option 1: Free disk space # Option 2: Adjust watermarks temporarily:
curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{ "transient": { "cluster.routing.allocation.disk.watermark.low": "90%", "cluster.routing.allocation.disk.watermark.high": "95%", "cluster.routing.allocation.disk.watermark.flood_stage": "98%" } }'
# Reroute shards after freeing space: curl -XPOST 'http://localhost:9200/_cluster/reroute?retry_failed=true'
# Delete unnecessary indices: curl -XDELETE 'http://localhost:9200/old-index-*'
# Or close indices: curl -XPOST 'http://localhost:9200/old-index/_close' ```
Step 4: Handle Node Failure
```bash # Check cluster nodes: curl -XGET 'http://localhost:9200/_cat/nodes?v'
# If node left, check pending tasks: curl -XGET 'http://localhost:9200/_cluster/pending_tasks?pretty'
# If node won't return, remove from cluster: # Wait for timeout (default 30m), then shards will be reassigned
# Force allocation to remaining nodes: curl -XPOST 'http://localhost:9200/_cluster/reroute' -d '{ "commands": [{ "allocate_stale_primary": { "index": "myindex", "shard": 0, "node": "node-1", "accept_data_loss": true } }] }'
# WARNING: accept_data_loss may lose data # Only use if no replica available
# Check for replica shards: curl -XGET 'http://localhost:9200/_cat/shards/myindex?v' | grep -v UNASSIGNED ```
Step 5: Fix Allocation Failures
```bash # Check allocation settings: curl -XGET 'http://localhost:9200/_cluster/settings?pretty'
# Enable allocation if disabled: curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{ "persistent": { "cluster.routing.allocation.enable": "all" } }'
# Reset allocation filters: curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{ "persistent": { "cluster.routing.allocation.exclude._name": null, "cluster.routing.allocation.require._name": null } }'
# Retry failed allocations: curl -XPOST 'http://localhost:9200/_cluster/reroute?retry_failed=true'
# Check shard corruption: curl -XGET 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state' | grep -E "INITIALIZING|RELOCATING" ```
Step 6: Handle Corrupted Shards
```bash # Check for corrupted shards in logs: grep -i "corrupt" /var/log/opensearch/my-cluster.log
# List corrupted shards: curl -XGET 'http://localhost:9200/_shard_stores?pretty'
# Remove corrupted shard (data loss): curl -XPOST 'http://localhost:9200/_cluster/reroute' -d '{ "commands": [{ "allocate_empty_primary": { "index": "myindex", "shard": 0, "node": "node-1", "accept_data_loss": true } }] }'
# Better: restore from snapshot if available ```
Step 7: Resolve Network Partition
```bash # Check nodes communication: curl -XGET 'http://localhost:9200/_nodes/stats/os?pretty'
# Check ping responses: for node in node1 node2 node3; do curl -XGET "http://$node:9200/_cluster/health?pretty" done
# Check firewall rules: sudo iptables -L -n | grep 9200 sudo iptables -L -n | grep 9300
# Minimum master nodes setting: # Prevent split-brain: (nodes/2) + 1 curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{ "persistent": { "discovery.zen.minimum_master_nodes": 2 } }'
# Restart nodes to rejoin: systemctl restart opensearch ```
Step 8: Restore from Snapshot
```bash # Register snapshot repository: curl -XPUT 'http://localhost:9200/_snapshot/my_backup' -d '{ "type": "fs", "settings": { "location": "/mnt/backups/my_backup" } }'
# List snapshots: curl -XGET 'http://localhost:9200/_snapshot/my_backup/_all?pretty'
# Restore snapshot: curl -XPOST 'http://localhost:9200/_snapshot/my_backup/snapshot_1/_restore' -d '{ "indices": "myindex", "ignore_unavailable": true, "include_global_state": false }'
# Restore to new index: curl -XPOST 'http://localhost:9200/_snapshot/my_backup/snapshot_1/_restore' -d '{ "indices": "myindex", "rename_pattern": "(.+)", "rename_replacement": "restored_$1" }'
# Check restore status: curl -XGET 'http://localhost:9200/_cat/recovery?v' ```
Step 9: Prevent Future Issues
```bash # Configure replication: curl -XPUT 'http://localhost:9200/_template/default_replicas' -d '{ "index_patterns": ["*"], "settings": { "number_of_replicas": 1 } }'
# Set up snapshot automation: curl -XPUT 'http://localhost:9200/_snapshot/automated_backup' -d '{ "type": "fs", "settings": { "location": "/mnt/backups/automated" } }'
# Create snapshot lifecycle: curl -XPUT 'http://localhost:9200/_slm/policy/nightly-snapshots' -d '{ "schedule": "0 30 1 * * ?", "name": "<nightly-snap-{now/d}>", "repository": "automated_backup", "config": { "indices": ["*"] } }'
# Monitor cluster health: curl -XGET 'http://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=30s' ```
Step 10: Production Cluster Health Monitoring
```bash # Monitoring script: cat << 'EOF' > /usr/local/bin/opensearch-health.sh #!/bin/bash
HOST="localhost:9200"
echo "=== OpenSearch Cluster Health ==="
# Cluster status STATUS=$(curl -s "http://$HOST/_cluster/health" | jq -r '.status') echo "Status: $STATUS"
if [ "$STATUS" = "red" ]; then echo "ALERT: Cluster is RED!"
echo -e "\nUnassigned shards:" curl -s "http://$HOST/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason" | grep UNASSIGNED
echo -e "\nAllocation explain:" curl -s "http://$HOST/_cluster/allocation/explain?pretty" | jq '.description'
echo -e "\nNode disk usage:" curl -s "http://$HOST/_cat/allocation?v" fi
echo -e "\nNode stats:" curl -s "http://$HOST/_cat/nodes?v&h=name,heap.percent,ram.percent,disk.used_percent,load_1m"
echo -e "\nIndices count:" curl -s "http://$HOST/_cat/indices?v&health=red" | head -10 EOF
chmod +x /usr/local/bin/opensearch-health.sh
# Prometheus metrics: curl -XGET 'http://localhost:9200/_nodes/stats?metric=fs,os,jvm,process&format=prometheus'
# Key metrics to monitor: # - cluster_health_status # - cluster_health_number_of_unassigned_shards # - fs_total_disk_used_percent # - jvm_mem_heap_used_percent ```
OpenSearch Cluster Red Status Checklist
| Check | Command | Expected |
|---|---|---|
| Cluster health | _cluster/health | green/yellow |
| Unassigned shards | _cat/shards | None UNASSIGNED |
| Disk usage | _cat/allocation | < 85% |
| Node status | _cat/nodes | All nodes present |
| Allocation enabled | _cluster/settings | "all" |
| Replicas | index settings | >= 1 |
Verify the Fix
```bash # After fixing cluster:
# 1. Check cluster health curl -XGET 'http://localhost:9200/_cluster/health?pretty' # Output: "status" : "green" or "yellow"
# 2. Verify no unassigned shards curl -XGET 'http://localhost:9200/_cat/shards?v' | grep UNASSIGNED # Output: Empty
# 3. Check all indices green curl -XGET 'http://localhost:9200/_cat/indices?v' | grep -v green # Output: Only yellow/yellow indices listed
# 4. Test search operations curl -XGET 'http://localhost:9200/myindex/_search?size=1' # Output: Valid search results
# Compare before/after: # Before: status: red, 5 unassigned shards # After: status: green, 0 unassigned shards ```
Related Issues
- [Fix Elasticsearch Cluster Yellow Status](/articles/fix-elasticsearch-cluster-yellow-status)
- [Fix Elasticsearch Shard Allocation Failed](/articles/fix-elasticsearch-shard-allocation-failed)
- [Fix Elasticsearch Index Corrupted](/articles/fix-elasticsearch-index-corrupted)