Your Grafana dashboards are showing "Datasource connection failed" or "No data" errors, and you're losing visibility into your systems. This is a common issue that can stem from network problems, authentication failures, or misconfiguration. Let's walk through the systematic approach to diagnose and fix these problems.

Understanding the Error

Grafana datasource errors typically appear in several ways:

In the UI: `` Datasource connection failed

bash
Error querying datasource: bad gateway
bash
Failed to call resource

In Grafana logs: `` logger=tsdb.prometheus t=2024-01-15T10:23:45.123Z level=error msg="Failed to query datasource" err="Post \"http://prometheus:9090/api/v1/query\": dial tcp: lookup prometheus: no such host"

bash
logger=sqlstore t=2024-01-15T10:23:45.123Z level=error msg="Failed to connect to database" err="dial tcp 10.0.0.5:3306: connect: connection refused"

Initial Diagnosis

Start by checking the datasource configuration and testing the connection:

```bash # Get current datasource configuration via API curl -s http://admin:password@localhost:3000/api/datasources | jq '.[] | {name: .name, type: .type, url: .url}'

# Test a specific datasource curl -s http://admin:password@localhost:3000/api/datasources/1/health | jq '.'

# Or use Grafana CLI grafana-cli admin data-migration check ```

Check Grafana logs for connection errors:

```bash # For systemd installations journalctl -u grafana-server -f | grep -i "datasource|connection|error"

# For Docker/Kubernetes kubectl logs -l app=grafana -n monitoring -f | grep -i "datasource|connection"

# Check the main Grafana log file tail -f /var/log/grafana/grafana.log | grep -i "datasource|error" ```

Common Cause 1: Network Connectivity

The most common cause is that Grafana cannot reach the datasource host.

Diagnosis:

```bash # Test connectivity from Grafana server to datasource curl -v http://prometheus-server:9090/api/v1/query?query=up

# For Kubernetes environments, test from inside the Grafana pod kubectl exec -it grafana-0 -n monitoring -- sh curl http://prometheus:9090/-/healthy

# Check DNS resolution nslookup prometheus-server dig prometheus-server +short

# Check if port is open nc -zv prometheus-server 9090

# Test with the exact URL from datasource config curl -v http://prometheus-server.monitoring.svc.cluster.local:9090/-/healthy ```

Solution:

Fix the network path or update the datasource URL:

bash
# For Kubernetes, ensure proper service discovery
# Update datasource URL to use the service name
curl -X PATCH http://admin:password@localhost:3000/api/datasources/1 \
  -H "Content-Type: application/json" \
  -d '{
    "url": "http://prometheus-server.monitoring.svc.cluster.local:9090"
  }'

For firewall issues:

```bash # Check firewall rules iptables -L -n | grep 9090

# For firewalld firewall-cmd --list-all

# Allow traffic if needed firewall-cmd --add-port=9090/tcp --permanent firewall-cmd --reload ```

Common Cause 2: Authentication Failures

Many datasources require authentication, and incorrect credentials will cause connection failures.

Error patterns: `` Error 401: Unauthorized

bash
Error 403: Forbidden

Diagnosis:

```bash # Test datasource with authentication curl -u username:password http://datasource-host:9090/api/v1/query?query=up

# Test with basic auth header curl -H "Authorization: Basic $(echo -n 'user:password' | base64)" \ http://datasource-host:9090/api/v1/query?query=up

# Test with bearer token curl -H "Authorization: Bearer your-token" \ http://datasource-host:9090/api/v1/query?query=up

# For databases, test connection directly mysql -h mysql-host -u grafana -p -e "SELECT 1" psql -h postgres-host -U grafana -d grafana -c "SELECT 1" ```

Solution:

Update datasource configuration with correct credentials:

bash
# Update via API
curl -X PUT http://admin:password@localhost:3000/api/datasources/1 \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://prometheus:9090",
    "access": "proxy",
    "basicAuth": true,
    "basicAuthUser": "admin",
    "secureJsonData": {
      "basicAuthPassword": "newpassword"
    }
  }'

For database datasources:

bash
curl -X PUT http://admin:password@localhost:3000/api/datasources/1 \
  -H "Content-Type: application/json" \
  -d '{
    "name": "PostgreSQL",
    "type": "postgres",
    "url": "postgres:5432",
    "database": "grafana",
    "user": "grafana",
    "secureJsonData": {
      "password": "securepassword"
    },
    "jsonData": {
      "sslmode": "disable"
    }
  }'

Common Cause 3: TLS/SSL Certificate Issues

When datasources use HTTPS, certificate problems can prevent connections.

Error patterns: `` x509: certificate signed by unknown authority

bash
x509: certificate has expired or is not yet valid

Diagnosis:

```bash # Check certificate validity echo | openssl s_client -connect datasource-host:443 -servername datasource-host 2>/dev/null | openssl x509 -noout -dates

# Check certificate chain openssl s_client -connect datasource-host:443 -servername datasource-host -showcerts

# Test connection skipping TLS verification curl -k https://datasource-host/metrics ```

Solution:

Option 1: Add custom CA certificate to Grafana:

```bash # Add CA to Grafana's trusted certificates cp /path/to/ca.crt /etc/grafana/ca.crt

# Update Grafana configuration cat >> /etc/grafana/grafana.ini << EOF [server] protocol = https cert_file = /etc/grafana/server.crt cert_key = /etc/grafana/server.key

[security] tls_skip_verify_insecure = false EOF

# Restart Grafana systemctl restart grafana-server ```

Option 2: Configure datasource to skip TLS verification (not recommended for production):

bash
curl -X PATCH http://admin:password@localhost:3000/api/datasources/1 \
  -H "Content-Type: application/json" \
  -d '{
    "jsonData": {
      "tlsSkipVerify": true
    }
  }'

Option 3: Add custom CA to datasource:

bash
curl -X PUT http://admin:password@localhost:3000/api/datasources/1 \
  -H "Content-Type: application/json" \
  -d '{
    "name": "SecurePrometheus",
    "type": "prometheus",
    "url": "https://prometheus:9090",
    "jsonData": {
      "tlsAuth": true,
      "tlsAuthWithCACert": true
    },
    "secureJsonData": {
      "tlsCACert": "-----BEGIN CERTIFICATE-----\nMIID...\n-----END CERTIFICATE-----"
    }
  }'

Common Cause 4: Datasource Service Issues

Sometimes the datasource itself is not running or is unhealthy.

Diagnosis:

```bash # Check if Prometheus is running curl http://prometheus:9090/-/healthy curl http://prometheus:9090/-/ready

# Check Prometheus status systemctl status prometheus

# For Kubernetes kubectl get pods -l app=prometheus -n monitoring kubectl logs -l app=prometheus -n monitoring --tail=50

# Check database connectivity kubectl exec -it postgres-0 -- pg_isready

# Check Elasticsearch health curl http://elasticsearch:9200/_cluster/health ```

Solution:

Fix the datasource service:

```bash # Restart Prometheus if it's down kubectl rollout restart deployment/prometheus-server -n monitoring

# Check for resource constraints kubectl describe pod prometheus-server-0 -n monitoring

# Check events kubectl get events -n monitoring --sort-by='.lastTimestamp' ```

Common Cause 5: Proxy and Access Mode Issues

Grafana has two access modes: server (proxy) and browser (direct). The wrong setting can cause failures.

Server (Proxy) mode: Grafana server makes the request to the datasource. Browser (Direct) mode: User's browser makes the request directly.

Diagnosis:

```bash # Check current access mode curl -s http://admin:password@localhost:3000/api/datasources | jq '.[] | {name: .name, access: .access}'

# For browser/direct mode, test from your local machine curl http://datasource-host:9090/api/v1/query?query=up ```

Solution:

Update access mode based on your network topology:

```bash # Set to server/proxy mode (most common) curl -X PATCH http://admin:password@localhost:3000/api/datasources/1 \ -H "Content-Type: application/json" \ -d '{ "access": "proxy" }'

# Or browser/direct mode (requires datasource to be accessible from user's browser) curl -X PATCH http://admin:password@localhost:3000/api/datasources/1 \ -H "Content-Type: application/json" \ -d '{ "access": "direct" }' ```

Common Cause 6: Timeout Issues

Large queries or slow datasources can cause timeouts.

Error pattern: `` context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Solution:

Increase timeout settings:

bash
curl -X PATCH http://admin:password@localhost:3000/api/datasources/1 \
  -H "Content-Type: application/json" \
  -d '{
    "jsonData": {
      "timeout": "60",
      "httpMethod": "POST"
    }
  }'

Or in grafana.ini:

```ini [database] query_cache_lifetime = 30s

[dataproxy] timeout = 60 dialTimeout = 30 ```

Common Cause 7: Resource Limits

Grafana or the datasource might be resource-constrained.

Diagnosis:

```bash # Check Grafana resource usage curl http://localhost:3000/api/admin/stats | jq '.'

# For Kubernetes kubectl top pods -n monitoring kubectl describe pod grafana-0 -n monitoring

# Check Grafana configuration cat /etc/grafana/grafana.ini | grep -A 10 "[database]" ```

Solution:

yaml
# Increase resources for Grafana
resources:
  limits:
    cpu: "2"
    memory: "2Gi"
  requests:
    cpu: "500m"
    memory: "512Mi"

Verification

After making changes, verify the datasource is working:

```bash # Test datasource health curl -s http://admin:password@localhost:3000/api/datasources/1/health | jq '.'

# Run a test query curl -s http://admin:password@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up | jq '.'

# Check in Grafana UI # Navigate to Configuration > Data Sources > [Your Datasource] > Test ```

Prevention

Set up monitoring for datasource health:

yaml
# Add to Prometheus alerting rules
groups:
  - name: grafana_health
    rules:
      - alert: GrafanaDatasourceDown
        expr: grafana_datasource_request_total{status="error"} > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Grafana datasource {{ $labels.datasource }} is failing"

Regular health checks:

bash
# Add to a cron job or monitoring system
#!/bin/bash
DATASOURCES=$(curl -s http://admin:password@localhost:3000/api/datasources | jq -r '.[].id')
for ID in $DATASOURCES; do
  HEALTH=$(curl -s http://admin:password@localhost:3000/api/datasources/$ID/health | jq -r '.status')
  if [ "$HEALTH" != "OK" ]; then
    echo "Datasource $ID is unhealthy: $HEALTH"
    # Send alert
  fi
done

The key to resolving datasource connection issues is to test connectivity at each layer: network, authentication, TLS, and service health. Start with the simplest tests and work your way through the stack.