Introduction Elasticsearch Index Lifecycle Management (ILM) automates index rollover, shrink, force merge, and deletion. When ILM gets stuck in the `roll_over` step, new data continues to accumulate in the write index, causing it to grow beyond optimal size and degrading search performance.

Symptoms - `GET /_ilm/status` shows ILM running but indices stuck in `roll_over` step - Write index grows far beyond `max_primary_shard_size` or `max_age` thresholds - `GET /my_alias/_ilm/explain` shows `"step": "check-rollover-ready"` with error details - `GET _cat/indices/my-index-*?v` shows one index much larger than others - No new rolled-over indices created despite meeting rollover conditions

Common Causes - Rollover alias not set as the write index - Index does not meet any rollover condition (size, age, doc count) - ILM policy conditions are misconfigured - Cluster read-only block due to disk watermark - Previous ILM step error not cleared

Step-by-Step Fix 1. **Check ILM status for stuck indices": ```bash curl -s localhost:9200/my-index-000001/_ilm/explain?pretty # Look for: # "step": "check-rollover-ready" # "step_info": { "reason": "rollover condition not met" } ```

  1. 1.**Verify the alias is set as the write index":
  2. 2.```bash
  3. 3.curl -s localhost:9200/_alias/my_alias?pretty
  4. 4.# The write index should have "is_write_index": true

# If not set, fix it: curl -X POST localhost:9200/_aliases -H 'Content-Type: application/json' -d '{ "actions": [ { "add": { "index": "my-index-000001", "alias": "my_alias", "is_write_index": true } } ] }' ```

  1. 1.**Manually trigger rollover if conditions are met":
  2. 2.```bash
  3. 3.curl -X POST localhost:9200/my_alias/_rollover?pretty
  4. 4.# Or with specific conditions
  5. 5.curl -X POST localhost:9200/my_alias/_rollover?pretty -H 'Content-Type: application/json' -d '{
  6. 6."conditions": {
  7. 7."max_age": "7d",
  8. 8."max_primary_shard_size": "50gb"
  9. 9.}
  10. 10.}'
  11. 11.`
  12. 12.**Clear ILM errors and retry":
  13. 13.```bash
  14. 14.curl -X POST localhost:9200/_ilm/retry?pretty -H 'Content-Type: application/json' -d '{
  15. 15."indices": ["my-index-000001"]
  16. 16.}'
  17. 17.`
  18. 18.**Fix the ILM policy if conditions are wrong":
  19. 19.```bash
  20. 20.curl -X PUT localhost:9200/_ilm/policy/my_policy?pretty -H 'Content-Type: application/json' -d '{
  21. 21."policy": {
  22. 22."phases": {
  23. 23."hot": {
  24. 24."actions": {
  25. 25."rollover": {
  26. 26."max_primary_shard_size": "50gb",
  27. 27."max_age": "7d"
  28. 28.}
  29. 29.}
  30. 30.},
  31. 31."warm": {
  32. 32."min_age": "7d",
  33. 33."actions": {
  34. 34."shrink": { "number_of_shards": 1 },
  35. 35."forcemerge": { "max_num_segments": 1 }
  36. 36.}
  37. 37.},
  38. 38."delete": {
  39. 39."min_age": "30d",
  40. 40."actions": { "delete": {} }
  41. 41.}
  42. 42.}
  43. 43.}
  44. 44.}'
  45. 45.`

Prevention - Always set `is_write_index: true` on the initial index when creating an ILM-managed alias - Monitor ILM step status with automated health checks - Use `GET /_ilm/status` and `GET /{index}/_ilm/explain` in monitoring dashboards - Set reasonable rollover conditions based on shard size targets (30-50GB per shard) - Test ILM policies in staging with realistic data volumes - Clear disk watermark blocks promptly to allow ILM to proceed - Document ILM policies and their expected rollover schedule