Introduction
Elasticsearch node frequently dropping from cluster when heartbeat timeout or network unstable. This guide provides step-by-step diagnosis and resolution.
Symptoms
Typical error output:
WARN: Node leaving cluster
Node "es-node-3" disconnected: heartbeat timeout 60s exceeded
Cluster state not synchronizedCommon Causes
- 1.Network latency exceeding ping timeout
- 2.GC pauses causing heartbeat delays
- 3.Node overload reducing responsiveness
- 4.Discovery configuration incorrect
Step-by-Step Fix
Step 1: Check Current State
curl -XGET 'localhost:9200/_cluster/health?pretty'
curl -XGET 'localhost:9200/_cat/nodes?v'
curl -XGET 'localhost:9200/_nodes/stats/process?pretty'Step 2: Identify Root Cause
curl -XGET 'localhost:9200/_cluster/health?pretty'
curl -XGET 'localhost:9200/_nodes/stats?pretty'Step 3: Apply Primary Fix
```bash # Increase ping timeout PUT _cluster/settings { "persistent": { "discovery.zen.fd.ping_timeout": "90s", "discovery.zen.fd.ping_retries": 5 } }
# Increase node fault detection timeout PUT _cluster/settings { "persistent": { "cluster.fault_detection.leader_check.timeout": "30s" } } ```
Step 4: Apply Alternative Fix
```bash # Alternative fix: Check node stats GET _nodes/stats?pretty
# Update specific index settings PUT my-index/_settings { "index": { "refresh_interval": "30s" } }
# Verify the fix GET _cat/indices?v&index=my-index ```
Step 5: Verify the Fix
curl -XGET 'localhost:9200/_cluster/health?pretty'
curl -XGET 'localhost:9200/_cat/nodes?v'
# All nodes should be presentCommon Pitfalls
- Not waiting for cluster state propagation after settings change
- Using text field for aggregations instead of keyword
- Setting circuit breaker limits too low for production workload
- Ignoring disk watermark warnings until cluster blocks
Best Practices
- Monitor cluster health regularly with _cluster/health API
- Use keyword fields for aggregations to avoid fielddata
- Set circuit breaker limits based on heap size
- Configure ILM policies for automated index management
Related Issues
- Elasticsearch Cluster Red Status
- Elasticsearch Index Not Found
- Elasticsearch Query Timeout
- Elasticsearch Node High CPU