Your Datadog dashboard shows gaps in metrics, hosts are marked as offline, or you're getting alerts that the Datadog agent has stopped reporting. This means you're losing visibility into your infrastructure. Let's diagnose and fix the agent reporting issues.
Understanding the Problem
Datadog agent failures typically manifest as:
- Missing metrics in dashboards
- Hosts showing "Offline" in infrastructure list
- No recent data in logs or traces
- Agent status check failures
Error patterns in agent logs:
Failed to send payloads: API key is invalidConnection refused: https://api.datadoghq.comError posting payload: HTTP 403 ForbiddenAgent is inactive or disabledInitial Diagnosis
Check the Datadog agent status and logs:
```bash # Check agent status sudo datadog-agent status
# For older versions sudo /etc/init.d/datadog-agent status
# Check agent logs sudo tail -n 100 /var/log/datadog/agent.log | grep -i "error|fail|warn"
# Check specific component logs sudo tail -n 50 /var/log/datadog/collector.log sudo tail -n 50 /var/log/datadog/forwarder.log
# Check agent info sudo datadog-agent info
# For Windows # Check via PowerShell: Get-Service datadogagent ```
Common Cause 1: Invalid or Missing API Key
The agent cannot authenticate with Datadog due to API key issues.
Error pattern:
``
API key is invalid or missing
HTTP 403: Forbidden - invalid API keyDiagnosis:
```bash # Check current API key configuration sudo cat /etc/datadog-agent/datadog.yaml | grep api_key
# Verify API key is valid (this will show it's configured) sudo datadog-agent status | grep -i "api"
# Test API key manually curl -X POST "https://api.datadoghq.com/api/v1/series" \ -H "Content-Type: application/json" \ -H "DD-API-KEY: your-api-key" \ -d '{"series":[{"metric":"test.metric","points":[[1640000000,1]]}]}'
# Check for empty or placeholder key sudo grep "api_key:" /etc/datadog-agent/datadog.yaml ```
Solution:
Update API key configuration:
```bash # Get correct API key from Datadog UI # Organization Settings > API Keys
# Update configuration file sudo vi /etc/datadog-agent/datadog.yaml
# Set correct API key api_key: your_valid_api_key_here
# For application key (optional for some features) app_key: your_application_key_here
# Restart agent sudo systemctl restart datadog-agent # Or for older versions sudo /etc/init.d/datadog-agent restart ```
Verify the fix:
```bash # Check agent status sudo datadog-agent status
# Test metric submission sudo datadog-agent flare --metric ```
Common Cause 2: Network Connectivity Issues
Agent cannot reach Datadog endpoints due to network problems.
Error pattern:
``
Failed to connect to api.datadoghq.com
Network timeout: unable to reach intakeDiagnosis:
```bash # Test connectivity to Datadog endpoints curl -v https://api.datadoghq.com/api/v1/validate curl -v https://trace.datadoghq.com curl -v https://logs.datadoghq.com
# Check DNS resolution nslookup api.datadoghq.com dig api.datadoghq.com
# Test with specific proxy settings if configured curl -x http://proxy:port https://api.datadoghq.com/api/v1/validate
# Check firewall rules iptables -L -n | grep -E "443|8443" sudo firewall-cmd --list-all
# Test from inside container if running in Docker docker exec datadog-agent curl https://api.datadoghq.com/api/v1/validate ```
Solution:
Fix network connectivity:
```bash # Allow outbound HTTPS to Datadog endpoints # Required ports and hosts: # api.datadoghq.com:443 (US) # app.datadoghq.eu:443 (EU)
# For iptables iptables -A OUTPUT -d api.datadoghq.com -p tcp --dport 443 -j ACCEPT
# For firewalld firewall-cmd --permanent --add-rich-rule='rule family="ipv4" destination address="api.datadoghq.com" port protocol="tcp" port="443" accept' firewall-cmd --reload
# If proxy is required # /etc/datadog-agent/datadog.yaml proxy: https: http://proxy-server:port http: http://proxy-server:port no_proxy: - localhost - 127.0.0.1
# For container environments, ensure network mode allows outbound docker run --network host datadog/agent:latest ```
Common Cause 3: Datadog Site Configuration Mismatch
Agent is configured for wrong Datadog site (US vs EU).
Error pattern:
``
Invalid API key for site configuration
Diagnosis:
```bash # Check current site configuration sudo cat /etc/datadog-agent/datadog.yaml | grep site
# Check agent status for site sudo datadog-agent status | grep -i "site"
# Verify API key matches site # US site API keys start with different prefix than EU curl -X GET "https://api.datadoghq.com/api/v1/validate" -H "DD-API-KEY: your-key" curl -X GET "https://api.datadoghq.eu/api/v1/validate" -H "DD-API-KEY: your-key" ```
Solution:
Configure correct site:
```bash # For US site (default) # /etc/datadog-agent/datadog.yaml site: datadoghq.com api_key: us-api-key
# For EU site site: datadoghq.eu api_key: eu-api-key
# For US3 site (FedRAMP) site: us3.datadoghq.com api_key: us3-api-key
# For US5 site site: us5.datadoghq.com api_key: us5-api-key
# Restart agent after change sudo systemctl restart datadog-agent ```
Common Cause 4: Agent Process Not Running
The agent daemon itself is stopped or crashed.
Error pattern:
``
Agent is not running
Diagnosis:
```bash # Check agent process ps aux | grep datadog-agent systemctl status datadog-agent
# Check for crash indicators sudo journalctl -u datadog-agent --since "1 hour ago" | grep -i "crash|exit|fatal"
# Check agent logs for startup errors sudo head -100 /var/log/datadog/agent.log
# For Windows Get-Process -Name "Datadog Agent" -ErrorAction SilentlyContinue Get-Service datadogagent ```
Solution:
Restart and fix agent process:
```bash # Start the agent sudo systemctl start datadog-agent sudo systemctl enable datadog-agent
# If agent fails to start, check configuration sudo datadog-agent configcheck
# Check for configuration errors sudo datadog-agent check
# Look for specific startup errors sudo grep -i "error|fatal" /var/log/datadog/agent.log | head -20
# For Docker, restart container docker restart datadog-agent
# Check container logs docker logs datadog-agent --tail 100 ```
Common Cause 5: Agent Forwarder Queue Overflow
Forwarder queue is full, preventing metric submission.
Error pattern:
``
Forwarder queue is full, dropping payloads
Diagnosis:
```bash # Check forwarder status sudo datadog-agent status | grep -A 20 "Forwarder"
# Check queue metrics sudo datadog-agent status | grep -i "queue|payload"
# Check forwarder logs sudo tail -50 /var/log/datadog/forwarder.log | grep -i "queue|drop"
# Monitor queue length sudo datadog-agent status --json | jq '.forwarder.queue_length' ```
Solution:
Clear queue and optimize forwarder:
```bash # Increase forwarder queue size # /etc/datadog-agent/datadog.yaml forwarder_timeout: 20 forwarder_retry_queue_max_size: 100
# Reduce collection frequency temporarily # /etc/datadog-agent/datadog.yaml collector_frequency: 30 # Default is 15 seconds
# Restart agent sudo systemctl restart datadog-agent
# For immediate relief, restart clears queue sudo systemctl restart datadog-agent ```
Common Cause 6: Disabled Checks or Integrations
Individual checks are disabled or failing.
Error pattern:
``
Check disabled: cpu
Integration check failedDiagnosis:
```bash # List all checks sudo datadog-agent status | grep -i "check"
# Check specific integration status sudo datadog-agent check nginx sudo datadog-agent check postgres
# List enabled/disabled checks sudo datadog-agent configcheck | grep -E "enabled|disabled"
# Check for check errors sudo grep -i "check.*error|check.*fail" /var/log/datadog/collector.log ```
Solution:
Enable and fix checks:
```bash # Enable specific check # /etc/datadog-agent/conf.d/cpu.d/conf.yaml init_config:
instances: [{"collect_cpu_time": true}]
# For integration checks, ensure proper configuration # /etc/datadog-agent/conf.d/nginx.d/conf.yaml init_config:
instances: - nginx_status_url: http://localhost/nginx_status
# Run check manually to verify sudo datadog-agent check nginx -l debug
# Restart agent sudo systemctl restart datadog-agent ```
Common Cause 7: Resource Constraints
Agent is starved for CPU or memory.
Error pattern:
``
Agent process killed by OOM
Diagnosis:
```bash # Check agent resource usage ps aux | grep datadog-agent top -p $(pgrep -d',' -f datadog-agent)
# Check memory limits sudo cat /sys/fs/cgroup/memory/system.slice/datadog-agent.service/memory.usage_in_bytes
# Look for OOM events sudo dmesg | grep -i "oom|killed process" | grep datadog
# Check agent memory metrics sudo datadog-agent status | grep -i "memory" ```
Solution:
Adjust resource limits:
```bash # Increase memory limit for agent # systemd override sudo systemctl edit datadog-agent
[Service] MemoryLimit=512M CPUQuota=50%
# Or in agent config # /etc/datadog-agent/datadog.yaml process_config: memory: enabled: true limit: 512MB
# Reduce number of checks if memory constrained # Disable unnecessary integrations sudo rm /etc/datadog-agent/conf.d/unneeded.d/conf.yaml
# Restart agent sudo systemctl restart datadog-agent ```
Common Cause 8: Host Name or Tag Issues
Duplicate hostnames or missing tags cause data aggregation problems.
Error pattern:
``
Hostname already exists in Datadog
Diagnosis:
```bash # Check hostname configuration sudo cat /etc/datadog-agent/datadog.yaml | grep hostname
# Check detected hostname sudo datadog-agent status | grep -i "hostname"
# Check tags configuration sudo cat /etc/datadog-agent/datadog.yaml | grep tags
# Verify in Datadog UI # Infrastructure List - check for duplicate hosts ```
Solution:
Set unique hostname and tags:
```bash # Set explicit hostname # /etc/datadog-agent/datadog.yaml hostname: unique-server-name
# Or use hostname detection settings hostname_detection: use_fqdn: true
# Add proper tags tags: - env:production - region:us-east - service:web-api
# Restart agent sudo systemctl restart datadog-agent ```
Verification
After fixing, verify agent is reporting:
```bash # Check agent status sudo datadog-agent status
# Verify metric submission sudo datadog-agent check cpu sudo datadog-agent check disk
# Send test metric sudo datadog-agent metric send test.metric 1
# Check Datadog UI for host # Infrastructure > Infrastructure List # Should show host as Online
# Verify live metrics # Metrics > Explorer > search for system.cpu.user
# Check agent flare for comprehensive diagnostics sudo datadog-agent flare ```
Prevention
Monitor the Datadog agent itself:
```yaml # Create Datadog monitor for agent status # Monitor type: Metric # Metric: datadog.agent.running # Alert when: value < 1 for 5 minutes
# Or create process monitor # Monitor type: Process # Process name: datadog-agent # Alert when: process not running ```
Set up agent health dashboard:
# Create dashboard tracking agent health
# Metrics to include:
# - datadog.agent.running
# - datadog.agent.metrics_collected
# - datadog.agent.forwarder.queue_length
# - datadog.agent.http.latencyRegular agent health check script:
```bash #!/bin/bash # Check agent status STATUS=$(sudo datadog-agent status 2>&1)
# Check for errors if grep -q "Error" "$STATUS"; then echo "Agent errors detected" sudo datadog-agent status | mail -s "Datadog Agent Errors" admin@domain.com fi
# Verify connectivity if ! curl -s https://api.datadoghq.com/api/v1/validate -H "DD-API-KEY: $API_KEY" | grep -q "valid"; then echo "API connectivity failed" fi ```
Agent reporting issues usually stem from authentication, connectivity, or configuration problems. Start with agent status check, verify API key and network connectivity, then investigate specific check failures.