Introduction

Grafana dashboard loading failures occur when dashboards fail to render, showing errors like "Panel loading failed," "Datasource error," or infinite loading spinners. This prevents operators from viewing metrics, logs, and traces critical for system monitoring. Common causes include datasource connectivity issues (Prometheus, InfluxDB, Elasticsearch unreachable), query timeouts on large time ranges or high-cardinality data, panel render errors from malformed queries, Grafana server memory exhaustion, database backend issues (SQLite lock contention, PostgreSQL connection limits), plugin compatibility problems after upgrades, browser caching of stale dashboard JSON, authentication/token expiration, CORS configuration blocking datasource requests, and proxy configuration issues when Grafana is behind NGINX or load balancers. The fix requires diagnosing whether the issue is client-side (browser, caching), server-side (Grafana process, database), or datasource-side (query performance, connectivity). This guide provides production-proven troubleshooting for Grafana dashboard failures across Docker, Kubernetes, and bare-metal deployments.

Symptoms

  • Dashboard shows "Panel loading failed" error
  • Infinite loading spinner on dashboard or panel
  • "Datasource error: Unable to connect" message
  • "Query timeout" or "Request timeout" errors
  • Dashboard JSON fails to load (blank dashboard)
  • Grafana UI loads but panels show errors
  • "Too many concurrent queries" error
  • Browser console shows 500/502/504 errors
  • Grafana logs show "database is locked" (SQLite)
  • Panels render partially or show stale data

Common Causes

  • Datasource server unreachable (network, firewall, downtime)
  • Query complexity causing timeout (large time range, no aggregation)
  • High-cardinality data overwhelming browser rendering
  • Grafana server memory limit exceeded (OOM)
  • SQLite database locked during concurrent writes
  • PostgreSQL connection pool exhausted
  • Plugin version incompatible with Grafana version
  • Browser cache holding stale dashboard definition
  • Authentication session expired
  • Reverse proxy timeout too short for complex queries
  • TLS/SSL certificate issues for HTTPS datasources
  • Resource limits in Kubernetes (CPU/memory requests)

Step-by-Step Fix

### 1. Diagnose dashboard loading failure

Check Grafana server logs:

```bash # Docker deployment docker logs grafana 2>&1 | tail -100

# Kubernetes deployment kubectl logs -l app=grafana -n monitoring --tail=100

# Systemd deployment journalctl -u grafana-server --tail=100

# Look for errors: # - "database is locked" (SQLite contention) # - "context deadline exceeded" (query timeout) # - "connection refused" (datasource unreachable) # - "too many open files" (file descriptor limit) # - "out of memory" (memory exhaustion)

# Enable debug logging # Edit /etc/grafana/grafana.ini [log] mode = console file level = debug

# Or for specific components [log.filters] filter = sqlstore:debug filter = datasources:debug filter = query:debug

# Restart Grafana systemctl restart grafana-server ```

Check browser console errors:

```javascript // Open browser DevTools (F12) > Console // Look for errors like:

// Datasource connection error Error: Datasource unreachable: http://prometheus:9090

// Query timeout Error: Query timeout after 30000ms

// Panel render error Error: Failed to render panel: TypeError: Cannot read property 'map' of undefined

// Network tab shows failed requests: // GET /api/dashboards/... 500 Internal Server Error // GET /api/datasources/proxy/... 504 Gateway Timeout // POST /api/tsdb/query 503 Service Unavailable

// Check network timing // Look for requests taking > 30 seconds (default timeout) ```

Test datasource connectivity:

```bash # From Grafana server, test datasource connectivity

# Prometheus curl -s http://prometheus:9090/api/v1/query?query=up | jq

# InfluxDB curl -s -G http://influxdb:8086/query \ --data-urlencode "q=SHOW DATABASES" | jq

# Elasticsearch curl -s http://elasticsearch:9200/_cluster/health | jq

# If connection fails: # - Check network connectivity # - Verify firewall rules # - Check datasource service status # - Verify credentials

# From Grafana UI: # Configuration > Data Sources > Select datasource > "Save & Test" # Shows connection status and latency ```

### 2. Fix datasource timeout issues

Configure datasource timeout:

```bash # Grafana datasource timeout configuration # Edit datasource JSON or use UI

# Via API - Update Prometheus datasource curl -X PUT http://localhost:3000/api/datasources/1 \ -H "Authorization: Bearer <API_KEY>" \ -H "Content-Type: application/json" \ -d '{ "jsonData": { "timeInterval": "15s", "queryTimeout": "60s", "httpMethod": "POST" } }'

# Via UI: # Configuration > Data Sources > Prometheus # - Query timeout: 60s (default 30s) # - Time interval: 15s (auto-interval) # - HTTP method: POST (better for large queries)

# For InfluxDB curl -X PUT http://localhost:3000/api/datasources/2 \ -H "Authorization: Bearer <API_KEY>" \ -H "Content-Type: application/json" \ -d '{ "jsonData": { "queryTimeout": "120s", "httpMode": "POST" } }'

# For Elasticsearch curl -X PUT http://localhost:3000/api/datasources/3 \ -H "Authorization: Bearer <API_KEY>" \ -H "Content-Type: application/json" \ -d '{ "jsonData": { "logMessageField": "message", "logLevelField": "level", "timeout": "60s", "maxConcurrentShardRequests": 5 } }' ```

Optimize datasource queries:

```javascript // Prometheus query optimization

// BAD: High cardinality query without aggregation rate(http_requests_total[5m])

// GOOD: Aggregate by relevant labels sum(rate(http_requests_total[5m])) by (service, status_code)

// BAD: Too large time range rate(http_requests_total[5m]) // Over 30 days = millions of points

// GOOD: Use Grafana's time range variables rate(http_requests_total[$__rate_interval])

// Use recording rules for expensive queries # In Prometheus rules file: groups: - name: recording_rules rules: - record: service:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (service)

// Then in Grafana query: service:http_requests:rate5m

// InfluxDB query optimization

// BAD: Select all fields without filter SELECT * FROM metrics

// GOOD: Specific fields with time filter SELECT mean(value) FROM cpu WHERE host =~ /^$host$/ AND $timeFilter GROUP BY time($__interval), host

// Elasticsearch query optimization

// BAD: Full-text search without time filter { "query": { "match_all": {} } }

// GOOD: Time-filtered with aggregation { "query": { "bool": { "filter": [ { "range": { "@timestamp": { "gte": "$__from", "lte": "$__to" } } }, { "term": { "service": "$service" } } ] } }, "aggs": { "by_status": { "terms": { "field": "status" } } }, "size": 0 } ```

### 3. Fix database backend issues

SQLite database locked:

```bash # SQLite is default Grafana database # "database is locked" errors indicate concurrent write contention

# Check Grafana database sqlite3 /var/lib/grafana/grafana.db "PRAGMA busy_timeout;" # Should return timeout in ms (default: 5000)

# Increase busy timeout sqlite3 /var/lib/grafana/grafana.db "PRAGMA busy_timeout = 30000;"

# Persistent configuration # Edit /etc/grafana/grafana.ini [database] type = sqlite3 path = grafana.db busy_timeout = 30000

# WAL mode for better concurrency sqlite3 /var/lib/grafana/grafana.db "PRAGMA journal_mode=WAL;"

# Vacuum database (reclaim space) sqlite3 /var/lib/grafana/grafana.db "VACUUM;"

# Check database integrity sqlite3 /var/lib/grafana/grafana.db "PRAGMA integrity_check;"

# If corruption detected, restore from backup # Backup location: /var/lib/grafana/backup/ ```

Migrate to PostgreSQL (recommended for production):

```bash # PostgreSQL configuration for Grafana

# Create database and user createdb grafana createuser grafana psql -c "ALTER USER grafana WITH PASSWORD 'secure_password';" psql -c "GRANT ALL PRIVILEGES ON DATABASE grafana TO grafana;"

# Configure Grafana to use PostgreSQL # Edit /etc/grafana/grafana.ini [database] type = postgres host = localhost:5432 name = grafana user = grafana password = secure_password ssl_mode = disable max_open_conn = 100 max_idle_conn = 100 conn_max_lifetime = 14400

# PostgreSQL tuning for Grafana # Edit postgresql.conf max_connections = 200 shared_buffers = 256MB work_mem = 8MB maintenance_work_mem = 64MB

# Restart services systemctl restart postgresql systemctl restart grafana-server

# Verify connection grafana-cli admin settings | grep database ```

Fix PostgreSQL connection limits:

```bash # Check current connections psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'grafana';"

# Check max connections psql -c "SHOW max_connections;"

# If near limit, increase or add connection pooling

# Increase max connections psql -c "ALTER SYSTEM SET max_connections = 200;" systemctl restart postgresql

# Or use PgBouncer for connection pooling # /etc/pgbouncer/pgbouncer.ini [databases] grafana = host=localhost port=5432 dbname=grafana

[pgbouncer] listen_port = 6432 pool_mode = transaction max_client_conn = 1000 default_pool_size = 25

# Update Grafana config # /etc/grafana/grafana.ini [database] host = localhost:6432 # PgBouncer port ```

### 4. Fix memory and resource issues

Grafana memory limits:

```bash # Check Grafana memory usage ps aux | grep grafana # Or systemctl status grafana-server

# Docker memory limit docker stats grafana

# Kubernetes resource usage kubectl top pods -n monitoring -l app=grafana

# If memory exhausted:

# Increase memory limit (Docker) docker update --memory=2g grafana

# Increase memory limit (Kubernetes) kubectl patch deployment grafana -n monitoring --type='json' -p='[ {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "2Gi"}, {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "512Mi"} ]'

# Grafana memory configuration # Edit /etc/grafana/grafana.ini [analytics] reporting_enabled = false # Reduce memory usage

[unified_alerting] execute_alerts = false # Disable if not using alerts

# Limit concurrent queries [server] concurrent_render_limit = 5 ```

Browser memory issues:

```javascript // Large dashboards can exhaust browser memory

// Optimization strategies:

// 1. Reduce time range // Instead of "Last 30 days", use "Last 6 hours"

// 2. Reduce panel count // Split large dashboard into multiple smaller dashboards

// 3. Use downsampled data // Grafana > Panel edit > Query > Add reduction // Use mean, max, min instead of raw data

// 4. Limit data points // Panel edit > Query > Options > Max data points: 1000

// 5. Disable refresh for inactive tabs // Dashboard settings > General > Refresh on dashboard load: false

// 6. Use lazy loading // Split into rows, only expand needed sections

// Check browser memory usage // Chrome: Shift+Esc (Task Manager) // Firefox: about:performance ```

### 5. Fix plugin and caching issues

Plugin compatibility:

```bash # List installed plugins grafana-cli plugins ls

# Check for outdated plugins grafana-cli plugins update-all

# Update specific plugin grafana-cli plugins update grafana-clock-panel

# Remove problematic plugin grafana-cli plugins remove plugin-id

# Install compatible version grafana-cli plugins install plugin-id version

# Check plugin compatibility # https://grafana.com/grafana/plugins/<plugin-id>/

# Disable plugin (temporary fix) # Edit /etc/grafana/grafana.ini [plugins] allow_loading_unsigned_plugins = false

# Plugin configuration [plugin.<plugin-id>] enabled = true ```

Clear browser and server cache:

```bash # Clear browser cache # Ctrl+Shift+Delete (Chrome/Firefox) # Or hard refresh: Ctrl+F5

# Clear Grafana server cache # Restart Grafana service systemctl restart grafana-server

# Clear plugin cache rm -rf /var/lib/grafana/plugins/* grafana-cli plugins install <plugin-id>

# Clear dashboard cache (via API) curl -X DELETE http://localhost:3000/api/admin/cache \ -H "Authorization: Bearer <ADMIN_API_KEY>"

# Clear query cache # Edit /etc/grafana/grafana.ini [query_cache] enabled = false # Disable cache

# Or set cache TTL [query_cache] enabled = true ttl = 300 # 5 minutes ```

### 6. Fix reverse proxy timeout

NGINX configuration for Grafana:

```nginx # /etc/nginx/sites-available/grafana

server { listen 80; server_name grafana.example.com;

location / { proxy_pass http://localhost:3000;

# Increase timeout for Grafana queries proxy_connect_timeout 60s; proxy_send_timeout 120s; proxy_read_timeout 300s; # 5 minutes for complex queries

# Headers proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme;

# WebSocket support (for live updates) proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade";

# Buffer settings proxy_buffering off; # For streaming responses proxy_buffer_size 4k; proxy_buffers 8 4k; } }

# Test configuration nginx -t

# Reload NGINX systemctl reload nginx ```

HAProxy configuration:

```haproxy # /etc/haproxy/haproxy.cfg

frontend grafana bind *:80 mode http

# Increase timeouts timeout connect 10s timeout client 300s timeout server 300s

default_backend grafana_servers

backend grafana_servers mode http balance roundrobin

# Health check option httpchk GET /api/health http-check expect status 200

server grafana1 localhost:3000 check inter 5s fall 3 rise 2 ```

### 7. Monitor Grafana health

Grafana health endpoint:

```bash # Check Grafana health curl http://localhost:3000/api/health

# Output: # { # "commit": "abc123", # "database": "ok", # "version": "10.0.0" # }

# If database shows "error": # - Check database connectivity # - Check disk space # - Check database locks

# Check datasource health curl http://localhost:3000/api/health/plugins

# Check API metrics curl http://localhost:3000/metrics

# Prometheus metrics for Grafana itself # - grafana_http_request_duration_seconds # - grafana_active_render_calls # - grafana_datasource_request_total ```

Set up monitoring alerts:

```yaml # Prometheus alerting rules for Grafana

groups: - name: grafana-alerts rules: - alert: GrafanaDashboardLoadFailed expr: rate(grafana_dashboard_load_failed_total[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "Grafana dashboard load failures" description: "{{ $value }} dashboard load failures per second"

  • alert: GrafanaDatasourceError
  • expr: rate(grafana_datasource_request_error_total[5m]) > 0.5
  • for: 5m
  • labels:
  • severity: critical
  • annotations:
  • summary: "Grafana datasource errors"
  • description: "High rate of datasource errors on {{ $labels.datasource }}"
  • alert: GrafanaHighQueryLatency
  • expr: histogram_quantile(0.99, rate(grafana_http_request_duration_seconds_bucket[5m])) > 5
  • for: 10m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Grafana query latency high"
  • description: "p99 query latency is {{ $value }} seconds"
  • `

Prevention

  • Use PostgreSQL instead of SQLite for production deployments
  • Configure appropriate query timeouts based on datasource
  • Optimize PromQL/queries to reduce cardinality and time range
  • Set memory limits appropriate for dashboard complexity
  • Keep plugins updated and compatible with Grafana version
  • Configure reverse proxy timeouts longer than query timeouts
  • Implement dashboard load monitoring with alerting
  • Use recording rules for expensive Prometheus queries
  • Document dashboard optimization best practices
  • Test dashboard performance before adding to production
  • **Prometheus query timeout**: Query exceeded evaluation timeout
  • **InfluxDB connection refused**: Database unreachable
  • **Elasticsearch too many buckets**: Aggregation limit exceeded
  • **Database connection pool exhausted**: PostgreSQL max connections reached
  • **502 Bad Gateway**: Reverse proxy cannot reach Grafana