Introduction

Elasticsearch node frequently dropping from cluster when heartbeat timeout or network unstable. This guide provides step-by-step diagnosis and resolution.

Symptoms

Typical error output:

bash
WARN: Node leaving cluster
Node "es-node-3" disconnected: heartbeat timeout 60s exceeded
Cluster state not synchronized

Common Causes

  1. 1.Network latency exceeding ping timeout
  2. 2.GC pauses causing heartbeat delays
  3. 3.Node overload reducing responsiveness
  4. 4.Discovery configuration incorrect

Step-by-Step Fix

Step 1: Check Current State

bash
curl -XGET 'localhost:9200/_cluster/health?pretty'
curl -XGET 'localhost:9200/_cat/nodes?v'
curl -XGET 'localhost:9200/_nodes/stats/process?pretty'

Step 2: Identify Root Cause

bash
curl -XGET 'localhost:9200/_cluster/health?pretty'
curl -XGET 'localhost:9200/_nodes/stats?pretty'

Step 3: Apply Primary Fix

```bash # Increase ping timeout PUT _cluster/settings { "persistent": { "discovery.zen.fd.ping_timeout": "90s", "discovery.zen.fd.ping_retries": 5 } }

# Increase node fault detection timeout PUT _cluster/settings { "persistent": { "cluster.fault_detection.leader_check.timeout": "30s" } } ```

Step 4: Apply Alternative Fix

```bash # Alternative fix: Check node stats GET _nodes/stats?pretty

# Update specific index settings PUT my-index/_settings { "index": { "refresh_interval": "30s" } }

# Verify the fix GET _cat/indices?v&index=my-index ```

Step 5: Verify the Fix

bash
curl -XGET 'localhost:9200/_cluster/health?pretty'
curl -XGET 'localhost:9200/_cat/nodes?v'
# All nodes should be present

Common Pitfalls

  • Not waiting for cluster state propagation after settings change
  • Using text field for aggregations instead of keyword
  • Setting circuit breaker limits too low for production workload
  • Ignoring disk watermark warnings until cluster blocks

Best Practices

  • Monitor cluster health regularly with _cluster/health API
  • Use keyword fields for aggregations to avoid fielddata
  • Set circuit breaker limits based on heap size
  • Configure ILM policies for automated index management
  • Elasticsearch Cluster Red Status
  • Elasticsearch Index Not Found
  • Elasticsearch Query Timeout
  • Elasticsearch Node High CPU