Nagios is showing "CRITICAL - Service check timed out" for various services, or you're getting intermittent timeout alerts that turn out to be false positives. Check timeouts can indicate real problems, but often they stem from configuration issues or resource constraints.
Understanding Check Timeouts
Nagios service checks have configurable timeout limits. When a check plugin takes longer than the timeout, Nagios kills the process and reports a timeout error. Default timeouts vary by check but typically range from 10-60 seconds.
Error patterns:
CRITICAL - Service check timed out after 10.00 secondsConnection timed out: 110Host check timed outPlugin timed out while executing system callInitial Diagnosis
Start by checking the Nagios logs and identifying which checks are timing out:
```bash # Check Nagios logs for timeout errors grep -i "timeout" /var/log/nagios/nagios.log | tail -50
# Check for specific service timeouts grep "CRITICAL - Service check timed out" /var/log/nagios/nagios.log | tail -20
# View current Nagios status tail -f /var/log/nagios/nagios.log
# Check Nagios configuration grep -E "service_check_timeout|host_check_timeout" /etc/nagios/nagios.cfg
# Check specific service timeout settings grep "check_timeout" /etc/nagios/objects/*.cfg ```
Common Cause 1: Insufficient Timeout Setting
The check legitimately needs more time than configured.
Error pattern:
``
CRITICAL - Service check timed out after 30.00 seconds
Diagnosis:
```bash # Find the service definition grep -A 10 "service_description.*Web Server" /etc/nagios/objects/services.cfg
# Check timeout setting for the check grep "service_check_timeout" /etc/nagios/nagios.cfg
# Run the check manually to measure execution time /usr/lib/nagios/plugins/check_http -H target-host -w 5 -c 10 time /usr/lib/nagios/plugins/check_http -H target-host -w 5 -c 10
# Check command definition grep -A 5 "define command.*check_http" /etc/nagios/objects/commands.cfg ```
Solution:
Increase timeout for the service:
```bash # Edit Nagios main configuration # /etc/nagios/nagios.cfg service_check_timeout=60 host_check_timeout=30
# Or set timeout per service # /etc/nagios/objects/services.cfg define service { use generic-service host_name web-server service_description HTTP check_command check_http!-H $HOSTADDRESS$ -t 60 check_timeout 60 } ```
Update command definition with timeout parameter:
# /etc/nagios/objects/commands.cfg
define command {
command_name check_http
command_line $USER1$/check_http -H $HOSTADDRESS$ -t 60 -w $ARG1$ -c $ARG2$
}Restart Nagios:
```bash # Verify configuration nagios -v /etc/nagios/nagios.cfg
# Restart if valid systemctl restart nagios ```
Common Cause 2: Network Latency Issues
Network delays cause checks to exceed timeout limits.
Error pattern:
``
Connection timed out after 10 seconds
Diagnosis:
```bash # Test network connectivity ping -c 5 target-host
# Measure actual response time time curl -s http://target-host/
# Check for network issues traceroute target-host mtr target-host
# Test specific check with verbose output /usr/lib/nagios/plugins/check_http -H target-host -v
# Check packet loss ping -c 100 target-host | grep "packet loss" ```
Solution:
If network is legitimately slow:
```bash # Increase timeout for slow networks # /etc/nagios/objects/commands.cfg define command { command_name check_http_slow command_line $USER1$/check_http -H $HOSTADDRESS$ -t 30 -w 10 -c 20 }
# Use this command for remote hosts # /etc/nagios/objects/services.cfg define service { use generic-service host_name remote-site-server service_description HTTP check_command check_http_slow!10!20 check_timeout 30 } ```
Address underlying network issues:
```bash # Check for routing problems ip route show
# Check interface errors cat /proc/net/dev | grep -i error
# Monitor network during peak times # Consider using Nagios distributed monitoring for remote sites ```
Common Cause 3: Plugin Performance Issues
The check plugin itself is slow or inefficient.
Error pattern:
``
Plugin timed out while executing
Diagnosis:
```bash # Identify slow plugin /usr/lib/nagios/plugins/check_disk -w 10% -c 5% time /usr/lib/nagios/plugins/check_disk -w 10% -c 5%
# Check disk I/O during plugin execution iostat -x 1 10
# Check plugin resource usage /usr/lib/nagios/plugins/check_disk -w 10% -c 5% & pid=$! ps -p $pid -o pid,ppid,%cpu,%mem,etime,cmd wait $pid
# For database checks, measure query time time /usr/lib/nagios/plugins/check_mysql -H localhost -u nagios -p password ```
Solution:
Optimize plugin execution:
```bash # Use more efficient check parameters # Instead of checking all filesystems, check specific ones /usr/lib/nagios/plugins/check_disk -w 10% -c 5% -p /var -p /home
# For database checks, use simpler queries define command { command_name check_mysql_query command_line $USER1$/check_mysql_query -H $HOSTADDRESS$ -q "SELECT 1" -w 1 -c 2 }
# Use cached results where appropriate # Some plugins support caching ```
Consider parallelizing checks:
# Use check_multi for parallel execution
define command {
command_name check_multi_http
command_line $USER1$/check_multi -f /etc/nagios/multi/http.cfg -t 60
}Common Cause 4: Resource Constraints on Nagios Server
The Nagios server itself is overloaded.
Error pattern: Multiple services timing out simultaneously.
Diagnosis:
```bash # Check Nagios process count ps aux | grep nagios | wc -l
# Check system load top -bn1 | head -5
# Check Nagios process limits grep -E "max_concurrent_checks|max_service_check_spread" /etc/nagios/nagios.cfg
# Monitor during checks watch -n 1 'ps aux | grep nagios | wc -l'
# Check memory usage free -h
# Check I/O wait top -bn1 | grep -i "wa" ```
Solution:
Reduce check load or increase capacity:
```bash # Reduce concurrent checks # /etc/nagios/nagios.cfg max_concurrent_checks=50 # Reduce from default if overloaded
# Increase time between checks # /etc/nagios/objects/templates.cfg define service { name generic-service check_interval 10 # Increase from 5 minutes retry_interval 2 }
# Disable unnecessary checks temporarily # Comment out in services.cfg
# Or use check scheduling to spread load # /etc/nagios/nagios.cfg max_service_check_spread=30 service_inter_check_delay_method=s ```
Add resources to Nagios server:
```bash # Check current resource limits ulimit -a
# Increase limits if needed # /etc/security/limits.conf nagios soft nproc 1000 nagios hard nproc 2000 nagios soft nofile 8192
# Restart Nagios systemctl restart nagios ```
Common Cause 5: Host Check Timeout
Host checks (reachability) timing out prevents service checks.
Error pattern:
``
Host check timed out, assuming host is down
Diagnosis:
```bash # Check host check command grep -E "check_host_alive|host_check_command" /etc/nagios/objects/*.cfg
# Test host check manually /usr/lib/nagios/plugins/check_ping -H target-host -w 100,10% -c 200,20% time /usr/lib/nagios/plugins/check_ping -H target-host -w 100,10% -c 200,20%
# Check host timeout setting grep "host_check_timeout" /etc/nagios/nagios.cfg ```
Solution:
Adjust host check settings:
```bash # Increase host check timeout # /etc/nagios/nagios.cfg host_check_timeout=30
# Use faster host check command # /etc/nagios/objects/commands.cfg define command { command_name check_host_alive_fast command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 200,20% -c 500,50% -p 1 -t 5 }
# Update host definitions define host { use generic-host host_name remote-server address 10.0.0.5 check_command check_host_alive_fast } ```
Common Cause 6: Plugin Dependencies Slow
Check plugins depend on slow external services.
Error pattern:
``
DNS resolution timed out
SNMP query timed outDiagnosis:
```bash # For DNS-based checks, test DNS resolution time nslookup target-host dns-server
# For SNMP checks, test directly snmpwalk -v2c -c community target-host system time snmpwalk -v2c -c community target-host system
# For NRPE checks, test connectivity /usr/lib/nagios/plugins/check_nrpe -H remote-host -c check_disk time /usr/lib/nagios/plugins/check_nrpe -H remote-host -c check_disk ```
Solution:
Fix dependency issues:
```bash # For DNS issues, use faster DNS or bypass # Use IP addresses instead of hostnames /usr/lib/nagios/plugins/check_http -H 10.0.0.5
# For SNMP issues, reduce timeout or queries define command { command_name check_snmp_fast command_line $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o $ARG2$ -t 5 -r 2 }
# For NRPE, increase NRPE timeout on remote host # /etc/nagios/nrpe.cfg on remote host command_timeout=60 ```
Common Cause 7: Thread/Process Deadlocks
Nagios internal scheduling issues cause timeouts.
Error pattern:
``
Check execution blocked
Diagnosis:
```bash # Check Nagios internal stats nagiosstats -m all
# Check event execution grep -i "execution|schedule" /var/log/nagios/nagios.log
# Monitor check scheduling watch -n 5 'tail -20 /var/log/nagios/nagios.log | grep "Scheduled"' ```
Solution:
Restart and adjust scheduling:
```bash # Restart Nagios to clear stuck checks systemctl restart nagios
# Adjust scheduling parameters # /etc/nagios/nagios.cfg use_large_installation_tuning_mode=1 enable_flap_detection=0 ```
Verification
After fixes, verify checks complete:
```bash # Monitor Nagios logs tail -f /var/log/nagios/nagios.log | grep -i "timeout|completed"
# Check specific service status via CLI /usr/lib/nagios/plugins/check_http -H target-host
# Verify configuration nagios -v /etc/nagios/nagios.cfg
# Check Nagios status page for green services # Navigate to Nagios web interface
# Run check timing comparison echo "Before: $(grep -c 'timeout' /var/log/nagios/nagios.log)" sleep 300 echo "After: $(grep -c 'timeout' /var/log/nagios/nagios.log)" ```
Prevention
Monitor Nagios itself:
```bash # Add self-monitoring checks define service { use local-service host_name nagios-server service_description Nagios Process check_command check_nagios!10!/var/log/nagios/nagios.log!/var/spool/nagios/status.dat }
define service { use local-service host_name nagios-server service_description Nagios Latency check_command check_nagios_latency!5!10 } ```
Create timeout trend tracking:
```bash #!/bin/bash # timeout-monitor.sh - run via cron TIMEOUT_COUNT=$(grep -c "timeout" /var/log/nagios/nagios.log) echo "$(date): $TIMEOUT_COUNT timeouts" >> /var/log/nagios_timeout_trend.log
# Alert if timeouts spike if [ $TIMEOUT_COUNT -gt 100 ]; then mail -s "Nagios timeout spike" admin@domain.com <<< "$TIMEOUT_COUNT timeouts logged" fi ```
Check timeouts usually stem from insufficient timeout settings, network delays, or server overload. Measure actual check execution times, adjust timeouts appropriately, and ensure your Nagios server has adequate resources.