Fix Prometheus and Grafana Monitoring Errors - Complete Deep Dive Guide

Introduction

Prometheus and Grafana monitoring errors occur when metric collection fails, alerts don't fire or notify, dashboards show no data, or the monitoring stack becomes unavailable. Prometheus uses a pull-based model to scrape metrics from targets, stores them in a time-series database, and evaluates alerting rules. Grafana connects to Prometheus (and other data sources) to visualize metrics and manage dashboards. Alertmanager handles alert routing, deduplication, silencing, and notification. Common causes include scrape targets unavailable or misconfigured, service discovery failures, metric name/label changes breaking queries, PromQL syntax errors in queries or alerts, storage space exhaustion, federation issues, Grafana datasource connection failures, alertmanager routing misconfiguration, notification channel authentication failures, and resource exhaustion (memory, file descriptors). The fix requires understanding the monitoring architecture, debugging tools, and recovery procedures. This guide provides production-proven troubleshooting for Prometheus, Grafana, and Alertmanager issues.

Symptoms

Prometheus target shows DOWN status
Error scraping target: connection refused
Grafana dashboard shows No data or Panel error
Data source unavailable error in Grafana
Alerts not firing despite threshold exceeded
Alert notifications not delivered (email, Slack, PagerDuty)
Error sending alert: dial tcp: connect: connection refused
Prometheus error log: Error opening TSDB
WAL corruption detected in Prometheus logs
Grafana plugins failing to load
Too many series warning (cardinality explosion)
High memory usage causing OOM kills
Query timeout: query timed out

Common Causes

Scrape target service crashed or changed port
Firewall blocking Prometheus scrape traffic (port 9090, 9100, etc.)
Service discovery (Kubernetes, Consul, EC2) not finding targets
Metric relabeling dropping all metrics
Invalid PromQL in recording rules or alerts
Alertmanager configuration syntax error
Notification channel credentials expired/invalid
Disk space exhausted for Prometheus TSDB
WAL (Write-Ahead Log) corruption after crash
Grafana datasource URL incorrect or unreachable
SSL/TLS certificate issues for HTTPS datasources
High cardinality from unbounded labels (user IDs, IP addresses)
Retention period too long for available disk space
Federation configuration errors

Step-by-Step Fix

### 1. Diagnose Prometheus issues

Check Prometheus target status:

```bash # Check target status via UI # http://prometheus:9090/targets

# Check via API curl http://prometheus:9090/api/v1/targets

# Output shows: # { # "status": "success", # "data": { # "activeTargets": [ # { # "discoveredLabels": {...}, # "labels": {"job": "node", "instance": "server1:9100"}, # "scrapeUrl": "http://server1:9100/metrics", # "health": "up", # or "down" # "lastError": "context deadline exceeded", # "lastScrape": "2026-04-01T12:00:00Z", # "scrapeDuration": 0.5 # } # ], # "droppedTargets": [...] # } # }

# Check specific target health curl 'http://prometheus:9090/api/v1/targets?state=down'

# Check Prometheus itself curl http://prometheus:9090/api/v1/query?query=up{job="prometheus"} ```

Check Prometheus logs:

```bash # View Prometheus logs kubectl logs -l app=prometheus -n monitoring # Or docker logs prometheus

# Common error patterns:

# Scrape failures # level=error msg="Error scraping target" # err="Get \"http://target:9100/metrics\": dial tcp: connection refused"

# TSDB issues # level=error msg="Error opening TSDB" # err="mkdir /prometheus/data: no space left on device"

# WAL corruption # level=error msg="WAL corruption detected" # err="invalid checksum"

# Configuration errors # level=error msg="Error loading config" # err="yaml: unmarshal errors: line 42: field invalid_field not found" ```

Test scrape targets manually:

```bash # Test target metrics endpoint curl http://target-server:9100/metrics

# Should return metrics in Prometheus format: # # HELP node_cpu_seconds_total Total seconds CPUs have spent in each mode. # # TYPE node_cpu_seconds_total counter # node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78

# If connection refused: # - Check if exporter is running systemctl status node_exporter # - Check firewall ufw status | grep 9100 # - Check exporter listening ss -tlnp | grep 9100

# If timeout: # - Check network connectivity ping target-server # - Check for slow exporter curl --connect-timeout 10 http://target:9100/metrics ```

### 2. Fix scrape target issues

Update scrape configuration:

```yaml # prometheus.yml

# Basic scrape config scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']

job_name: 'node'
static_configs:
- targets:
- 'server1:9100'
- 'server2:9100'
- 'server3:9100'
labels:
env: 'production'
team: 'infrastructure'

# Scrape settings scrape_interval: 15s # Default: 1m scrape_timeout: 10s # Default: 10s metrics_path: /metrics # Default

# TLS for HTTPS targets scheme: https tls_config: ca_file: /etc/prometheus/ca.crt insecure_skip_verify: false

# Basic auth basic_auth: username: prometheus password: secret

# Kubernetes service discovery - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)

# After config change, reload Prometheus # Without restart: curl -X POST http://prometheus:9090/-/reload # Or send SIGHUP kill -HUP $(pgrep prometheus)

# Validate config before reload promtool check config prometheus.yml ```

Fix service discovery issues:

```yaml # EC2 service discovery - job_name: 'ec2-instances' ec2_sd_configs: - region: us-east-1 access_key: ACCESS_KEY secret_key: SECRET_KEY port: 9100 filters: - name: tag:Environment values: - production - name: instance-state-name values: - running

relabel_configs: - source_labels: [__meta_ec2_tag_Name] target_label: instance_name

# Consul service discovery - job_name: 'consul-services' consul_sd_configs: - server: consul:8500 services: - api - web

relabel_configs: - source_labels: [__meta_consul_service] target_label: service

# Debug service discovery # Check discovered targets curl http://prometheus:9090/api/v1/targets | jq '.data.droppedTargets' ```

Fix metric relabeling:

```yaml # Relabel configs run BEFORE scraping scrape_configs: - job_name: 'node' static_configs: - targets: ['server1:9100']

relabel_configs: # Drop targets matching pattern - source_labels: [__address__] regex: '.*:9100' action: drop # Be careful - drops all port 9100!

# Keep only specific targets - source_labels: [__meta_kubernetes_pod_label_app] regex: 'web|api' action: keep

# Replace label value - source_labels: [__meta_kubernetes_namespace] target_label: namespace replacement: '${1}'

# Add static label - target_label: monitored_by replacement: 'prometheus'

# Hash sensitive label - source_labels: [user_id] target_label: user_id_hash action: hash

# Drop high-cardinality label - regex: 'pod_template_hash' action: labeldrop

# Metric relabel configs run AFTER scraping metric_relabel_configs: # Drop specific metrics - source_labels: [__name__] regex: 'go_.*' action: drop

# Keep only specific metrics - source_labels: [__name__] regex: 'http_requests_total|node_.*' action: keep

# Rename metric - source_labels: [__name__] target_label: __name__ replacement: 'renamed_metric' ```

### 3. Fix Grafana datasource issues

Test datasource connection:

```bash # Check datasource via API curl -H "Authorization: Bearer API_KEY" \ http://grafana:3000/api/datasources

# Get specific datasource curl -H "Authorization: Bearer API_KEY" \ http://grafana:3000/api/datasources/1

# Test datasource connection curl -X POST \ -H "Authorization: Bearer API_KEY" \ -H "Content-Type: application/json" \ http://grafana:3000/api/datasources/proxy/1/

# Or via UI: Settings > Data Sources > [Select DS] > Save & Test ```

Configure Prometheus datasource:

```yaml # Grafana datasource config (config file) # /etc/grafana/provisioning/datasources/prometheus.yml

apiVersion: 1

datasources: - name: Prometheus type: prometheus access: proxy # or direct url: http://prometheus:9090 isDefault: true editable: false

# Authentication basicAuth: true basicAuthUser: prometheus secureJsonData: basicAuthPassword: secret

# TLS jsonData: tlsAuth: true tlsAuthWithCACert: true tlsSkipVerify: false tlsCACert: | -----BEGIN CERTIFICATE----- ... -----END CERTIFICATE-----

# Timeout settings timeout: 30 httpMethod: POST

# Alert manager alertmanagerUid: alertmanager ```

Debug Grafana dashboard issues:

```bash # Check Grafana logs kubectl logs -l app=grafana -n monitoring # Or docker logs grafana

# Common errors:

# Datasource unavailable # lvl=eror msg="Data source unavailable" # error="Post \"http://prometheus:9090/api/v1/query\": dial tcp: connection refused"

# Query error # lvl=eror msg="Panel data error" # error="invalid expression: unknown function: invalid_func()"

# Plugin error # lvl=eror msg="Plugin error" # error="Plugin requested was not found"

# Check dashboard JSON for errors # Download dashboard JSON and validate cat dashboard.json | jq '.templating.list[] | select(.query | type == "string")' # Query should be object, not string (common issue after export) ```

### 4. Fix Alertmanager issues

Check Alertmanager status:

```bash # Check Alertmanager UI # http://alertmanager:9093

# Check status via API curl http://alertmanager:9093/api/v1/status

# Check silenced alerts curl http://alertmanager:9093/api/v1/silences

# Check pending/firing alerts curl http://alertmanager:9093/api/v1/alerts

# Check receivers curl http://alertmanager:9093/api/v2/receivers ```

Validate Alertmanager configuration:

```yaml # alertmanager.yml

global: smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alertmanager@example.com' smtp_auth_username: 'alertmanager' smtp_auth_password: 'password' smtp_require_tls: true

# Slack slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'

# PagerDuty pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

# Templates templates: - '/etc/alertmanager/templates/*.tmpl'

# Inhibition rules inhibit_rules: # If critical alert fires, suppress warning for same alertname - source_matchers: - severity="critical" target_matchers: - severity="warning" equal: ['alertname', 'instance']

# Routing route: receiver: 'default-receiver' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h

routes: - matchers: - severity="critical" receiver: 'pagerduty-critical' continue: true

matchers:
- team="infrastructure"
receiver: 'slack-infra'

matchers:
- alertname="Watchdog"
receiver: 'null' # Silence Watchdog alerts

# Receivers receivers: - name: 'default-receiver' email_configs: - to: 'team@example.com' send_resolved: true

name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'PAGERDUTY_SERVICE_KEY'
severity: critical

name: 'slack-infra'
slack_configs:
- channel: '#alerts-infra'
send_resolved: true
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'

name: 'null'
# Empty receiver for silencing

# Validate config amtool check-config /etc/alertmanager/alertmanager.yml ```

Test notifications:

```bash # Send test alert with amtool amtool alert --alertmanager.url=http://alertmanager:9093 \ add alertname=TestAlert severity=warning instance=test

# Send test notification via API curl -X POST http://alertmanager:9093/api/v1/alerts \ -H "Content-Type: application/json" \ -d '[{ "labels": { "alertname": "TestAlert", "severity": "warning", "instance": "test" }, "annotations": { "summary": "Test alert", "description": "This is a test notification" }, "generatorURL": "http://test" }]'

# Check if notification was sent # Check receiver logs # Slack: Check channel # Email: Check inbox/spam # PagerDuty: Check incidents ```

### 5. Fix storage issues

Check Prometheus storage:

```bash # Check TSDB status curl http://prometheus:9090/api/v1/status/tsdb

# Output: # { # "headChunks": 12345, # "headSeries": 67890, # "chunksOnDisk": 123456, # "blocks": [ # { # "ulid": "01ABC...", # "minTime": 1711900000000, # "maxTime": 1711986400000, # "numSamples": 1234567, # "numSeries": 5000 # } # ] # }

# Check disk usage du -sh /prometheus/data/*

# List blocks ls -la /prometheus/data/01ABC*/

# Check WAL size du -sh /prometheus/data/wal/ ```

Fix disk space issues:

```yaml # prometheus.yml - Adjust retention # Storage size depends on: # - Retention time # - Number of series # - Sample rate # - Compression

storage: tsdb: path: /prometheus/data retention: time: 15d # Default: 15d size: 50GB # Max size (whichever comes first)

# Calculate required storage: # storage_needed = series_count * samples_per_second * bytes_per_sample * retention_seconds # Example: 100K series * 1 sample/15s * 2 bytes * 15 days # = 100000 * (1/15) * 2 * (15 * 24 * 3600) = ~17 GB

# Reduce storage by: # 1. Shorten retention # 2. Reduce scrape frequency for non-critical metrics # 3. Drop high-cardinality metrics with relabeling # 4. Use recording rules to aggregate and drop raw data

# Compact blocks manually promtool tsdb compact /prometheus/data

# Delete old blocks (use carefully!) rm -rf /prometheus/data/01ABC* # Blocks older than retention ```

Fix WAL corruption:

```bash # Stop Prometheus systemctl stop prometheus

# Backup data cp -r /prometheus/data /prometheus/data.backup

# Try to repair WAL promtool tsdb repair /prometheus/data

# If repair fails, remove WAL (will lose recent data) rm -rf /prometheus/data/wal rm -rf /prometheus/data/checkpoint*

# Start Prometheus (will rebuild from blocks) systemctl start prometheus

# Check if running curl http://prometheus:9090/api/v1/status/tsdb ```

### 6. Fix high cardinality issues

Identify high cardinality metrics:

```bash # Query for series count by metric name curl -G 'http://prometheus:9090/api/v1/query' \ --data-urlencode 'query=count by (__name__) ({__name__=~".+"})' \ | jq '.data.result[] | select(.value[1] | tonumber > 10000)'

# Top 10 metrics by series count curl -G 'http://prometheus:9090/api/v1/query' \ --data-urlencode 'query=topk(10, count by (__name__) ({__name__=~".+"}))'

# Check specific metric cardinality curl -G 'http://prometheus:9090/api/v1/series' \ --data-urlencode 'match[]=http_requests_total' \ | jq '.data | length'

# Find labels with high cardinality curl -G 'http://prometheus:9090/api/v1/label/user_id/values' \ | jq '.data | length' ```

Reduce cardinality:

```yaml # Drop high-cardinality labels scrape_configs: - job_name: 'api' metric_relabel_configs: # Drop user_id label (unbounded cardinality) - source_labels: [__name__] regex: '.*' action: labeldrop regex: 'user_id'

# Hash IP addresses - source_labels: [client_ip] target_label: client_ip_hash action: hash modulus: 1000

# Drop IP label entirely - regex: 'client_ip' action: labeldrop

# Aggregate histogram buckets - source_labels: [__name__] regex: 'http_request_duration_seconds_bucket' action: drop # Keep only specific le values # Replace with recording rule for aggregated view

# Drop entire high-cardinality metrics scrape_configs: - job_name: 'api' metric_relabel_configs: - source_labels: [__name__] regex: 'go_gc_heap_allocs_by_size_bytes.*' action: drop ```

### 7. Monitor the monitoring stack

Self-monitoring Prometheus:

```yaml # Prometheus monitors itself scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']

# Key self-monitoring metrics: # prometheus_tsdb_head_samples_appended_total # prometheus_tsdb_head_series # prometheus_tsdb_wal_writes_failed_total # prometheus_target_scrape_pool_exceeded_target_limit # prometheus_rule_evaluation_failures_total # prometheus_notifications_errors_total

# Alert on self-monitoring groups: - name: prometheus-self rules: - alert: PrometheusTargetScrapeFail expr: up{job="prometheus"} == 0 for: 5m labels: severity: critical annotations: summary: "Prometheus cannot scrape itself"

alert: PrometheusRuleEvaluationFail
expr: increase(prometheus_rule_evaluation_failures_total[5m]) > 0
labels:
severity: warning
annotations:
summary: "Prometheus rule evaluation failing"
`

Grafana dashboard for monitoring stack:

```yaml # Import standard dashboards # Prometheus: 3662 (Prometheus 2.0 Stats) # Alertmanager: 9573 (Alertmanager) # Node Exporter: 1860 (Node Exporter Full)

# Create custom dashboard with: # - Prometheus target UP/DOWN status # - Scrape duration histogram # - Series count over time # - WAL writes per second # - Query latency # - Alert evaluation latency # - Notification success/failure rate ```

Prevention

Set up alerts for monitoring stack health (meta-monitoring)
Implement proper retention policies based on disk capacity
Monitor series cardinality and set limits
Use recording rules to pre-aggregate high-cardinality data
Test alertmanager notifications regularly
Document runbooks for common monitoring issues
Backup Prometheus data and Grafana dashboards
Use federation or Thanos for multi-cluster setups
Implement proper TLS for all monitoring traffic
Regular capacity planning reviews

**404 Not Found**: Metric, dashboard, or datasource doesn't exist
**503 Service Unavailable**: Monitoring service temporarily unavailable
**context deadline exceeded**: Query timeout
**no data points**: Query returned no results
**template execution error**: Alert/dashboard template error

How to Fix Prometheus and Grafana Monitoring Errors - Complete Troubleshooting Guide

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Related Errors

Share this guide