You've noticed that some of your Prometheus targets are showing as "down" in the targets UI, and you're getting alerts about scrape failures. This is a critical issue because it means Prometheus cannot collect metrics from your services, leaving you blind to their performance and health.

Understanding the Problem

When Prometheus cannot scrape a target, it typically shows up in the /targets endpoint with a state of "DOWN" and an error message explaining why the scrape failed. Common error messages include:

bash
Get "http://10.0.0.5:9090/metrics": dial tcp 10.0.0.5:9090: connect: connection refused
bash
server returned HTTP status 401 Unauthorized
bash
context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Each of these errors points to a different root cause, and the fix depends entirely on what's actually happening.

Initial Diagnosis

Start by checking the current state of your targets through the Prometheus UI:

```bash # If you have access to the Prometheus UI, navigate to: # http://your-prometheus:9090/targets

# Or use the API to get target status curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health, lastError: .lastError}' ```

This gives you a quick overview of all targets and their error states. For a specific target investigation, check the Prometheus logs:

```bash # Check Prometheus logs for scrape errors kubectl logs -l app=prometheus -n monitoring | grep -i "scrape|error|failed"

# Or if running directly journalctl -u prometheus -f | grep -i "scrape|error" ```

Common Cause 1: Network Connectivity Issues

The most frequent cause is simply that Prometheus cannot reach the target. This could be due to firewalls, network segmentation, or the target not running.

Error pattern: `` Get "http://10.0.0.5:9090/metrics": dial tcp 10.0.0.5:9090: connect: connection refused

Diagnosis:

```bash # Test basic connectivity from Prometheus server curl -v http://target-host:port/metrics

# Check if the port is open nc -zv target-host 9090

# For Kubernetes environments, test from inside the Prometheus pod kubectl exec -it prometheus-server-0 -n monitoring -- curl http://target-service:9090/metrics

# Check DNS resolution nslookup target-host dig target-host

# Verify the target is actually running curl http://target-host:9090/-/healthy ```

Solution:

If the target isn't running, start it. If there's a firewall issue, you'll need to allow the traffic:

```bash # For firewalld firewall-cmd --add-port=9090/tcp --permanent firewall-cmd --reload

# For iptables iptables -A INPUT -p tcp --dport 9090 -j ACCEPT service iptables save

# For Kubernetes NetworkPolicy, ensure Prometheus can reach the target kubectl get networkpolicy -n target-namespace ```

Common Cause 2: Authentication and TLS Issues

Many production systems require authentication or use HTTPS. If Prometheus isn't configured with the right credentials, you'll see 401 or 403 errors.

Error pattern: `` server returned HTTP status 401 Unauthorized

Error pattern: `` x509: certificate signed by unknown authority

Diagnosis:

```bash # Test with the expected authentication method curl -u username:password http://target-host:9090/metrics curl -H "Authorization: Bearer your-token" http://target-host:9090/metrics

# Test TLS certificate openssl s_client -connect target-host:9090 -showcerts

# Check certificate validity echo | openssl s_client -servername target-host -connect target-host:9090 2>/dev/null | openssl x509 -noout -dates ```

Solution:

Update your Prometheus scrape configuration with proper authentication:

```yaml scrape_configs: - job_name: 'secured-target' basic_auth: username: admin password: yourpassword static_configs: - targets: ['target-host:9090']

# For TLS with custom CA - job_name: 'tls-target' scheme: https tls_config: ca_file: /etc/prometheus/certs/ca.crt cert_file: /etc/prometheus/certs/client.crt key_file: /etc/prometheus/certs/client.key insecure_skip_verify: false static_configs: - targets: ['target-host:9090'] ```

After updating the configuration, reload Prometheus:

```bash # Send SIGHUP to reload config kill -HUP $(pidof prometheus)

# Or via the API curl -X POST http://localhost:9090/-/reload ```

Common Cause 3: Timeout Issues

Some targets are slow to respond, especially if they have many metrics or are under heavy load. Default scrape timeouts might be too aggressive.

Error pattern: `` context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Diagnosis:

```bash # Time how long the target takes to respond time curl http://target-host:9090/metrics

# Check the number of metrics exposed curl http://target-host:9090/metrics | wc -l

# Check current scrape duration curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="slow-job") | .lastScrapeDuration' ```

Solution:

Increase the scrape timeout, but keep it less than the scrape interval:

yaml
scrape_configs:
  - job_name: 'slow-target'
    scrape_interval: 60s
    scrape_timeout: 50s
    static_configs:
      - targets: ['target-host:9090']

Common Cause 4: Incorrect Service Discovery

When using service discovery (Kubernetes, Consul, etc.), Prometheus might not be finding the targets correctly.

Error pattern: `` no targets found

Diagnosis:

```bash # Check what Prometheus has discovered curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets | length'

# For Kubernetes SD, check service discovery status curl -s http://localhost:9090/api/v1/sd_targets | jq '.'

# Check Prometheus config for SD issues curl -s http://localhost:9090/api/v1/status/config | jq '.data.yaml'

# Look for discovery errors in logs kubectl logs -l app=prometheus -n monitoring | grep -i "discovery|sd" ```

Solution:

Verify your service discovery configuration:

yaml
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - production
            - staging
    relabel_configs:
      # Only scrape pods with the annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Ensure pods have the required annotations:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: my-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    prometheus.io/path: "/metrics"

Common Cause 5: Resource Limits on the Target

Sometimes the target itself is overwhelmed and cannot respond to scrape requests.

Diagnosis:

```bash # Check target resource usage curl http://target-host:9090/metrics | grep -E "go_memstats|process_"

# Check if the target is throttled kubectl top pods -n target-namespace

# Check for OOMKilled or restart events kubectl describe pod target-pod -n target-namespace | grep -A5 "Events:" ```

Solution:

Increase resource limits for the target, or optimize the metrics exposition:

yaml
resources:
  limits:
    cpu: "500m"
    memory: "512Mi"
  requests:
    cpu: "250m"
    memory: "256Mi"

Verification

After applying your fix, verify that targets are healthy:

```bash # Check target status curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="up") | .labels.job'

# Check for recent scrape errors in logs kubectl logs -l app=prometheus -n monitoring --since=5m | grep -i "error|failed" | head -20

# Query a metric from the target to confirm data collection curl -s 'http://localhost:9090/api/v1/query?query=up{job="your-job"}' | jq '.data.result' ```

Prevention

To avoid future target down issues, implement these practices:

  • Set up alerting rules for target health:
yaml
groups:
  - name: prometheus_targets
    rules:
      - alert: TargetDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus target {{ $labels.job }} is down"
          description: "{{ $labels.instance }} has been down for more than 5 minutes."
  • Use proper service discovery instead of static targets where possible
  • Implement health checks in your applications
  • Set appropriate timeouts based on target response times
  • Monitor Prometheus itself with a second Prometheus instance

The key to resolving target down issues is systematically checking connectivity, authentication, timeouts, and service discovery configuration. Start with the error message in the targets UI, as it usually points directly to the root cause.