Introduction

A failing health check can take a healthy-looking app out of service because the platform decides it is unsafe to route traffic there. The app may even load manually in some cases, but the probe path, response code, timing, or dependency state still fails the infrastructure rule. The fix is to align the health-check contract with the app's real readiness behavior.

Symptoms

  • Load balancers, CDNs, or container platforms mark instances unhealthy
  • The service flaps in and out of rotation even though some requests still work
  • Deployments fail because readiness checks never pass
  • Logs show repeated probe requests followed by deregistration or restarts
  • The issue started after route changes, auth changes, startup changes, or dependency outages

Common Causes

  • The health-check path returns the wrong status code, body, or headers
  • The app is not ready before the platform starts probing it
  • A dependency failure causes the health endpoint to fail too aggressively
  • Authentication, redirects, or host rules block the probe path
  • Timeout thresholds are too short for the app's real startup or recovery behavior

Step-by-Step Fix

  1. Identify the exact platform running the health check and the path, method, host, expected code, and timeout it uses.
  2. Reproduce the same probe manually so you can compare real app behavior with what the load balancer expects.
  3. Check whether the endpoint is failing because the route changed, redirects now occur, or authentication was added by mistake.
  4. Review startup timing and readiness logic to see whether the platform begins probing before the app can actually serve requests.
  5. Inspect app logs and dependency health during probe failures because databases, caches, or migrations may delay readiness.
  6. Decide whether the health endpoint should test full dependency readiness or only basic process liveness, then align the behavior intentionally.
  7. Adjust timeout, interval, or grace-period settings only after confirming the app is otherwise healthy and the probe contract is correct.
  8. Retest through the platform so instances stay in rotation consistently instead of flapping.
  9. Document the health-check contract with the service so route changes and security layers do not accidentally break it later.