Introduction
Prometheus and Grafana monitoring errors occur when metric collection fails, alerts don't fire or notify, dashboards show no data, or the monitoring stack becomes unavailable. Prometheus uses a pull-based model to scrape metrics from targets, stores them in a time-series database, and evaluates alerting rules. Grafana connects to Prometheus (and other data sources) to visualize metrics and manage dashboards. Alertmanager handles alert routing, deduplication, silencing, and notification. Common causes include scrape targets unavailable or misconfigured, service discovery failures, metric name/label changes breaking queries, PromQL syntax errors in queries or alerts, storage space exhaustion, federation issues, Grafana datasource connection failures, alertmanager routing misconfiguration, notification channel authentication failures, and resource exhaustion (memory, file descriptors). The fix requires understanding the monitoring architecture, debugging tools, and recovery procedures. This guide provides production-proven troubleshooting for Prometheus, Grafana, and Alertmanager issues.
Symptoms
- Prometheus target shows
DOWNstatus Error scraping target: connection refused- Grafana dashboard shows
No dataorPanel error Data source unavailableerror in Grafana- Alerts not firing despite threshold exceeded
- Alert notifications not delivered (email, Slack, PagerDuty)
Error sending alert: dial tcp: connect: connection refused- Prometheus error log:
Error opening TSDB WAL corruption detectedin Prometheus logs- Grafana plugins failing to load
Too many serieswarning (cardinality explosion)- High memory usage causing OOM kills
- Query timeout:
query timed out
Common Causes
- Scrape target service crashed or changed port
- Firewall blocking Prometheus scrape traffic (port 9090, 9100, etc.)
- Service discovery (Kubernetes, Consul, EC2) not finding targets
- Metric relabeling dropping all metrics
- Invalid PromQL in recording rules or alerts
- Alertmanager configuration syntax error
- Notification channel credentials expired/invalid
- Disk space exhausted for Prometheus TSDB
- WAL (Write-Ahead Log) corruption after crash
- Grafana datasource URL incorrect or unreachable
- SSL/TLS certificate issues for HTTPS datasources
- High cardinality from unbounded labels (user IDs, IP addresses)
- Retention period too long for available disk space
- Federation configuration errors
Step-by-Step Fix
### 1. Diagnose Prometheus issues
Check Prometheus target status:
```bash # Check target status via UI # http://prometheus:9090/targets
# Check via API curl http://prometheus:9090/api/v1/targets
# Output shows: # { # "status": "success", # "data": { # "activeTargets": [ # { # "discoveredLabels": {...}, # "labels": {"job": "node", "instance": "server1:9100"}, # "scrapeUrl": "http://server1:9100/metrics", # "health": "up", # or "down" # "lastError": "context deadline exceeded", # "lastScrape": "2026-04-01T12:00:00Z", # "scrapeDuration": 0.5 # } # ], # "droppedTargets": [...] # } # }
# Check specific target health curl 'http://prometheus:9090/api/v1/targets?state=down'
# Check Prometheus itself curl http://prometheus:9090/api/v1/query?query=up{job="prometheus"} ```
Check Prometheus logs:
```bash # View Prometheus logs kubectl logs -l app=prometheus -n monitoring # Or docker logs prometheus
# Common error patterns:
# Scrape failures # level=error msg="Error scraping target" # err="Get \"http://target:9100/metrics\": dial tcp: connection refused"
# TSDB issues # level=error msg="Error opening TSDB" # err="mkdir /prometheus/data: no space left on device"
# WAL corruption # level=error msg="WAL corruption detected" # err="invalid checksum"
# Configuration errors # level=error msg="Error loading config" # err="yaml: unmarshal errors: line 42: field invalid_field not found" ```
Test scrape targets manually:
```bash # Test target metrics endpoint curl http://target-server:9100/metrics
# Should return metrics in Prometheus format: # # HELP node_cpu_seconds_total Total seconds CPUs have spent in each mode. # # TYPE node_cpu_seconds_total counter # node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
# If connection refused: # - Check if exporter is running systemctl status node_exporter # - Check firewall ufw status | grep 9100 # - Check exporter listening ss -tlnp | grep 9100
# If timeout: # - Check network connectivity ping target-server # - Check for slow exporter curl --connect-timeout 10 http://target:9100/metrics ```
### 2. Fix scrape target issues
Update scrape configuration:
```yaml # prometheus.yml
# Basic scrape config scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']
- job_name: 'node'
- static_configs:
- - targets:
- - 'server1:9100'
- - 'server2:9100'
- - 'server3:9100'
- labels:
- env: 'production'
- team: 'infrastructure'
# Scrape settings scrape_interval: 15s # Default: 1m scrape_timeout: 10s # Default: 10s metrics_path: /metrics # Default
# TLS for HTTPS targets scheme: https tls_config: ca_file: /etc/prometheus/ca.crt insecure_skip_verify: false
# Basic auth basic_auth: username: prometheus password: secret
# Kubernetes service discovery - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)
# After config change, reload Prometheus # Without restart: curl -X POST http://prometheus:9090/-/reload # Or send SIGHUP kill -HUP $(pgrep prometheus)
# Validate config before reload promtool check config prometheus.yml ```
Fix service discovery issues:
```yaml # EC2 service discovery - job_name: 'ec2-instances' ec2_sd_configs: - region: us-east-1 access_key: ACCESS_KEY secret_key: SECRET_KEY port: 9100 filters: - name: tag:Environment values: - production - name: instance-state-name values: - running
relabel_configs: - source_labels: [__meta_ec2_tag_Name] target_label: instance_name
# Consul service discovery - job_name: 'consul-services' consul_sd_configs: - server: consul:8500 services: - api - web
relabel_configs: - source_labels: [__meta_consul_service] target_label: service
# Debug service discovery # Check discovered targets curl http://prometheus:9090/api/v1/targets | jq '.data.droppedTargets' ```
Fix metric relabeling:
```yaml # Relabel configs run BEFORE scraping scrape_configs: - job_name: 'node' static_configs: - targets: ['server1:9100']
relabel_configs: # Drop targets matching pattern - source_labels: [__address__] regex: '.*:9100' action: drop # Be careful - drops all port 9100!
# Keep only specific targets - source_labels: [__meta_kubernetes_pod_label_app] regex: 'web|api' action: keep
# Replace label value - source_labels: [__meta_kubernetes_namespace] target_label: namespace replacement: '${1}'
# Add static label - target_label: monitored_by replacement: 'prometheus'
# Hash sensitive label - source_labels: [user_id] target_label: user_id_hash action: hash
# Drop high-cardinality label - regex: 'pod_template_hash' action: labeldrop
# Metric relabel configs run AFTER scraping metric_relabel_configs: # Drop specific metrics - source_labels: [__name__] regex: 'go_.*' action: drop
# Keep only specific metrics - source_labels: [__name__] regex: 'http_requests_total|node_.*' action: keep
# Rename metric - source_labels: [__name__] target_label: __name__ replacement: 'renamed_metric' ```
### 3. Fix Grafana datasource issues
Test datasource connection:
```bash # Check datasource via API curl -H "Authorization: Bearer API_KEY" \ http://grafana:3000/api/datasources
# Get specific datasource curl -H "Authorization: Bearer API_KEY" \ http://grafana:3000/api/datasources/1
# Test datasource connection curl -X POST \ -H "Authorization: Bearer API_KEY" \ -H "Content-Type: application/json" \ http://grafana:3000/api/datasources/proxy/1/
# Or via UI: Settings > Data Sources > [Select DS] > Save & Test ```
Configure Prometheus datasource:
```yaml # Grafana datasource config (config file) # /etc/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources: - name: Prometheus type: prometheus access: proxy # or direct url: http://prometheus:9090 isDefault: true editable: false
# Authentication basicAuth: true basicAuthUser: prometheus secureJsonData: basicAuthPassword: secret
# TLS jsonData: tlsAuth: true tlsAuthWithCACert: true tlsSkipVerify: false tlsCACert: | -----BEGIN CERTIFICATE----- ... -----END CERTIFICATE-----
# Timeout settings timeout: 30 httpMethod: POST
# Alert manager alertmanagerUid: alertmanager ```
Debug Grafana dashboard issues:
```bash # Check Grafana logs kubectl logs -l app=grafana -n monitoring # Or docker logs grafana
# Common errors:
# Datasource unavailable # lvl=eror msg="Data source unavailable" # error="Post \"http://prometheus:9090/api/v1/query\": dial tcp: connection refused"
# Query error # lvl=eror msg="Panel data error" # error="invalid expression: unknown function: invalid_func()"
# Plugin error # lvl=eror msg="Plugin error" # error="Plugin requested was not found"
# Check dashboard JSON for errors # Download dashboard JSON and validate cat dashboard.json | jq '.templating.list[] | select(.query | type == "string")' # Query should be object, not string (common issue after export) ```
### 4. Fix Alertmanager issues
Check Alertmanager status:
```bash # Check Alertmanager UI # http://alertmanager:9093
# Check status via API curl http://alertmanager:9093/api/v1/status
# Check silenced alerts curl http://alertmanager:9093/api/v1/silences
# Check pending/firing alerts curl http://alertmanager:9093/api/v1/alerts
# Check receivers curl http://alertmanager:9093/api/v2/receivers ```
Validate Alertmanager configuration:
```yaml # alertmanager.yml
global: smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alertmanager@example.com' smtp_auth_username: 'alertmanager' smtp_auth_password: 'password' smtp_require_tls: true
# Slack slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
# PagerDuty pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
# Templates templates: - '/etc/alertmanager/templates/*.tmpl'
# Inhibition rules inhibit_rules: # If critical alert fires, suppress warning for same alertname - source_matchers: - severity="critical" target_matchers: - severity="warning" equal: ['alertname', 'instance']
# Routing route: receiver: 'default-receiver' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h
routes: - matchers: - severity="critical" receiver: 'pagerduty-critical' continue: true
- matchers:
- - team="infrastructure"
- receiver: 'slack-infra'
- matchers:
- - alertname="Watchdog"
- receiver: 'null' # Silence Watchdog alerts
# Receivers receivers: - name: 'default-receiver' email_configs: - to: 'team@example.com' send_resolved: true
- name: 'pagerduty-critical'
- pagerduty_configs:
- - service_key: 'PAGERDUTY_SERVICE_KEY'
- severity: critical
- name: 'slack-infra'
- slack_configs:
- - channel: '#alerts-infra'
- send_resolved: true
- title: '{{ template "slack.title" . }}'
- text: '{{ template "slack.text" . }}'
- name: 'null'
- # Empty receiver for silencing
# Validate config amtool check-config /etc/alertmanager/alertmanager.yml ```
Test notifications:
```bash # Send test alert with amtool amtool alert --alertmanager.url=http://alertmanager:9093 \ add alertname=TestAlert severity=warning instance=test
# Send test notification via API curl -X POST http://alertmanager:9093/api/v1/alerts \ -H "Content-Type: application/json" \ -d '[{ "labels": { "alertname": "TestAlert", "severity": "warning", "instance": "test" }, "annotations": { "summary": "Test alert", "description": "This is a test notification" }, "generatorURL": "http://test" }]'
# Check if notification was sent # Check receiver logs # Slack: Check channel # Email: Check inbox/spam # PagerDuty: Check incidents ```
### 5. Fix storage issues
Check Prometheus storage:
```bash # Check TSDB status curl http://prometheus:9090/api/v1/status/tsdb
# Output: # { # "headChunks": 12345, # "headSeries": 67890, # "chunksOnDisk": 123456, # "blocks": [ # { # "ulid": "01ABC...", # "minTime": 1711900000000, # "maxTime": 1711986400000, # "numSamples": 1234567, # "numSeries": 5000 # } # ] # }
# Check disk usage du -sh /prometheus/data/*
# List blocks ls -la /prometheus/data/01ABC*/
# Check WAL size du -sh /prometheus/data/wal/ ```
Fix disk space issues:
```yaml # prometheus.yml - Adjust retention # Storage size depends on: # - Retention time # - Number of series # - Sample rate # - Compression
storage: tsdb: path: /prometheus/data retention: time: 15d # Default: 15d size: 50GB # Max size (whichever comes first)
# Calculate required storage: # storage_needed = series_count * samples_per_second * bytes_per_sample * retention_seconds # Example: 100K series * 1 sample/15s * 2 bytes * 15 days # = 100000 * (1/15) * 2 * (15 * 24 * 3600) = ~17 GB
# Reduce storage by: # 1. Shorten retention # 2. Reduce scrape frequency for non-critical metrics # 3. Drop high-cardinality metrics with relabeling # 4. Use recording rules to aggregate and drop raw data
# Compact blocks manually promtool tsdb compact /prometheus/data
# Delete old blocks (use carefully!) rm -rf /prometheus/data/01ABC* # Blocks older than retention ```
Fix WAL corruption:
```bash # Stop Prometheus systemctl stop prometheus
# Backup data cp -r /prometheus/data /prometheus/data.backup
# Try to repair WAL promtool tsdb repair /prometheus/data
# If repair fails, remove WAL (will lose recent data) rm -rf /prometheus/data/wal rm -rf /prometheus/data/checkpoint*
# Start Prometheus (will rebuild from blocks) systemctl start prometheus
# Check if running curl http://prometheus:9090/api/v1/status/tsdb ```
### 6. Fix high cardinality issues
Identify high cardinality metrics:
```bash # Query for series count by metric name curl -G 'http://prometheus:9090/api/v1/query' \ --data-urlencode 'query=count by (__name__) ({__name__=~".+"})' \ | jq '.data.result[] | select(.value[1] | tonumber > 10000)'
# Top 10 metrics by series count curl -G 'http://prometheus:9090/api/v1/query' \ --data-urlencode 'query=topk(10, count by (__name__) ({__name__=~".+"}))'
# Check specific metric cardinality curl -G 'http://prometheus:9090/api/v1/series' \ --data-urlencode 'match[]=http_requests_total' \ | jq '.data | length'
# Find labels with high cardinality curl -G 'http://prometheus:9090/api/v1/label/user_id/values' \ | jq '.data | length' ```
Reduce cardinality:
```yaml # Drop high-cardinality labels scrape_configs: - job_name: 'api' metric_relabel_configs: # Drop user_id label (unbounded cardinality) - source_labels: [__name__] regex: '.*' action: labeldrop regex: 'user_id'
# Hash IP addresses - source_labels: [client_ip] target_label: client_ip_hash action: hash modulus: 1000
# Drop IP label entirely - regex: 'client_ip' action: labeldrop
# Aggregate histogram buckets - source_labels: [__name__] regex: 'http_request_duration_seconds_bucket' action: drop # Keep only specific le values # Replace with recording rule for aggregated view
# Drop entire high-cardinality metrics scrape_configs: - job_name: 'api' metric_relabel_configs: - source_labels: [__name__] regex: 'go_gc_heap_allocs_by_size_bytes.*' action: drop ```
### 7. Monitor the monitoring stack
Self-monitoring Prometheus:
```yaml # Prometheus monitors itself scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']
# Key self-monitoring metrics: # prometheus_tsdb_head_samples_appended_total # prometheus_tsdb_head_series # prometheus_tsdb_wal_writes_failed_total # prometheus_target_scrape_pool_exceeded_target_limit # prometheus_rule_evaluation_failures_total # prometheus_notifications_errors_total
# Alert on self-monitoring groups: - name: prometheus-self rules: - alert: PrometheusTargetScrapeFail expr: up{job="prometheus"} == 0 for: 5m labels: severity: critical annotations: summary: "Prometheus cannot scrape itself"
- alert: PrometheusRuleEvaluationFail
- expr: increase(prometheus_rule_evaluation_failures_total[5m]) > 0
- labels:
- severity: warning
- annotations:
- summary: "Prometheus rule evaluation failing"
`
Grafana dashboard for monitoring stack:
```yaml # Import standard dashboards # Prometheus: 3662 (Prometheus 2.0 Stats) # Alertmanager: 9573 (Alertmanager) # Node Exporter: 1860 (Node Exporter Full)
# Create custom dashboard with: # - Prometheus target UP/DOWN status # - Scrape duration histogram # - Series count over time # - WAL writes per second # - Query latency # - Alert evaluation latency # - Notification success/failure rate ```
Prevention
- Set up alerts for monitoring stack health (meta-monitoring)
- Implement proper retention policies based on disk capacity
- Monitor series cardinality and set limits
- Use recording rules to pre-aggregate high-cardinality data
- Test alertmanager notifications regularly
- Document runbooks for common monitoring issues
- Backup Prometheus data and Grafana dashboards
- Use federation or Thanos for multi-cluster setups
- Implement proper TLS for all monitoring traffic
- Regular capacity planning reviews
Related Errors
- **404 Not Found**: Metric, dashboard, or datasource doesn't exist
- **503 Service Unavailable**: Monitoring service temporarily unavailable
- **context deadline exceeded**: Query timeout
- **no data points**: Query returned no results
- **template execution error**: Alert/dashboard template error