Introduction When Prometheus targets show as "DOWN", metrics are not being collected, creating blind spots in monitoring. This can mask real issues and cause alerting gaps.

Symptoms - Prometheus UI shows target status: "DOWN" with "last error" message - Error: "context deadline exceeded" (scrape timeout) - Error: "connection refused" - Error: "server returned HTTP status 401 Unauthorized" - Metrics gap in Grafana dashboards

Common Causes - Target application crashed or restarted - Network policy blocking Prometheus to target traffic - TLS certificate expired on target metrics endpoint - Scrape timeout too short for slow metrics endpoint - Service discovery returning stale targets

Step-by-Step Fix 1. **Check target status in Prometheus UI**: Navigate to Status > Targets and look at the "Last Error" column.

  1. 1.Test metrics endpoint manually:
  2. 2.```bash
  3. 3.curl -v http://<target-ip>:<port>/metrics
  4. 4.# For TLS targets
  5. 5.curl -vk https://<target-ip>:<port>/metrics
  6. 6.`
  7. 7.Check Prometheus scrape configuration:
  8. 8.```yaml
  9. 9.scrape_configs:
  10. 10.- job_name: 'my-app'
  11. 11.scrape_interval: 15s
  12. 12.scrape_timeout: 10s
  13. 13.static_configs:
  14. 14.- targets: ['my-app:8080']
  15. 15.`
  16. 16.Check service discovery:
  17. 17.Navigate to Status > Service Discovery in Prometheus UI to see discovered targets.

Prevention - Set scrape_timeout to 2x the expected scrape duration - Use multiple Prometheus instances for redundancy - Monitor Prometheus target health with meta-monitoring - Implement metrics endpoint health checks - Use push gateway for batch jobs that cannot be scraped