Your application keeps crashing with no error message. Services restart unexpectedly. A long-running job terminates mid-execution. When processes die without clear errors, tracking down the cause requires systematic investigation.
Understanding the Problem
Processes can be killed for many reasons: the OOM killer, segmentation faults, signal handling, resource limits, watchdogs, or explicit termination. Each leaves different traces.
Typical Symptoms
Process 12345 terminated unexpectedly
Job 1 'python script.py' terminated by signal 9 (Killed)
Segmentation fault (core dumped)
Trace/breakpoint trap (core dumped)
KilledYou might notice:
- Services restarting frequently
- Long-running jobs dying mid-execution
- No useful error messages in application logs
- Process disappears from ps output without explanation
Diagnosing the Issue
Step 1: Check Kernel Logs for OOM Killer
The most common cause of unexpected kills is the OOM killer:
```bash # Check for OOM killer activity dmesg | grep -i "killed process" dmesg | grep -i "out of memory" dmesg | grep -i "oom"
# Check recent kernel messages journalctl -k --since "1 hour ago" | grep -i -E "(killed|oom|memory)"
# Alternative log locations grep -i "killed process" /var/log/syslog grep -i "out of memory" /var/log/messages ```
OOM killer output looks like:
``
Out of memory: Killed process 1842 (java) total-vm:8388608kB, anon-rss:4194304kB
Step 2: Check for Segmentation Faults
```bash # Look for segfaults in kernel logs dmesg | grep -i segfault
# Example output: # python[12345]: segfault at 0 ip 00007f8c4a2b3f91 sp 00007ffc3a8e9a80 error 4 in libpython3.8.so
# Check core dump settings cat /proc/sys/kernel/core_pattern
# Find core dumps find /var/crash -type f -mtime -1 2>/dev/null ls -la /var/lib/systemd/coredump/ ```
Step 3: Check Process Exit Codes and Signals
```bash # If running in shell, check last exit code echo $?
# Common signal codes: # 1 - SIGHUP (hangup, terminal closed) # 2 - SIGINT (interrupt, Ctrl+C) # 9 - SIGKILL (forced kill, cannot be caught) # 11 - SIGSEGV (segmentation fault) # 15 - SIGTERM (normal termination request)
# Check systemd service exit codes systemctl status service-name
# Show service logs journalctl -u service-name -n 100 ```
Step 4: Check Resource Limits
```bash # Check limits for a running process cat /proc/$(pidof process-name)/limits
# Check shell limits ulimit -a
# Key limits to check: ulimit -c # Core file size ulimit -d # Data segment size ulimit -f # File size ulimit -n # Open files ulimit -s # Stack size ulimit -t # CPU time ulimit -v # Virtual memory ```
Step 5: Check for Watchdogs and Supervisors
```bash # Check if systemd watchdog is configured systemctl show service-name | grep Watchdog
# Check for process supervisors ps aux | grep -E "(supervisord|monit|god|runit|s6)"
# Check cron for process monitoring crontab -l grep -r "process" /etc/cron.* ```
Solutions by Cause
Solution 1: OOM Killer Prevention
If the OOM killer is terminating your process:
```bash # Check OOM score of your process cat /proc/$(pidof your-app)/oom_score cat /proc/$(pidof your-app)/oom_score_adj
# Protect the process (lower score = less likely killed) # Range: -1000 (never kill) to 1000 (likely kill) echo -500 > /proc/$(pidof your-app)/oom_score_adj
# For systemd services, add to service file: # [Service] # OOMScoreAdjust=-500
# Or completely protect from OOM: echo -1000 > /proc/$(pidof your-app)/oom_score_adj ```
Adjust system memory settings:
```bash # Reduce swappiness sysctl -w vm.swappiness=10
# Configure overcommit sysctl -w vm.overcommit_memory=2 sysctl -w vm.overcommit_ratio=80
# Make persistent cat >> /etc/sysctl.conf << EOF vm.swappiness = 10 vm.overcommit_memory = 2 vm.overcommit_ratio = 80 EOF ```
Solution 2: Fix Segmentation Faults
Segfaults usually indicate bugs in the application:
```bash # Enable core dumps for debugging ulimit -c unlimited
# Install debug symbols for the application apt-get install package-dbgsym # Debian/Ubuntu debuginfo-install package # RHEL/CentOS
# Run with gdb to catch the crash gdb --args ./your-application # Then type 'run' in gdb, and 'bt' after crash
# Analyze existing core dump gdb /path/to/binary /path/to/core # Then: bt full ```
Common segfault causes: - Null pointer dereference - Buffer overflow - Stack overflow - Use after free - Accessing invalid memory
Solution 3: Adjust Resource Limits
```bash # For the current shell session ulimit -n 65535 # Increase open files ulimit -u 4096 # Increase user processes ulimit -v unlimited # Remove virtual memory limit
# For systemd services, edit the service file: # [Service] # LimitNOFILE=65535 # LimitNPROC=4096 # LimitSIGPENDING=4096
# Create override for existing service systemctl edit service-name
# Add: [Service] LimitNOFILE=65535 LimitNPROC=4096
# Apply systemctl daemon-reload systemctl restart service-name ```
For PAM limits (all users):
```bash # Edit /etc/security/limits.conf echo '* soft nofile 65535' >> /etc/security/limits.conf echo '* hard nofile 65535' >> /etc/security/limits.conf echo '* soft nproc 4096' >> /etc/security/limits.conf echo '* hard nproc 4096' >> /etc/security/limits.conf
# User must log out and back in for changes to take effect ```
Solution 4: Handle Signals Properly
If your application doesn't handle signals gracefully:
```bash # Test if application handles SIGTERM kill -TERM $(pidof your-app)
# Common signals and their meanings: # SIGHUP (1) - Reload config, reopen log files # SIGINT (2) - Interrupt (Ctrl+C) # SIGTERM (15) - Normal termination # SIGKILL (9) - Force kill (cannot be caught)
# For applications that don't handle SIGHUP when terminal closes nohup ./your-app &
# Or use screen/tmux tmux new-session -d -s app './your-app' tmux attach -t app ```
Solution 5: Fix Systemd Watchdog
If systemd watchdog is killing your process:
```bash # Check watchdog settings systemctl show your-service | grep Watchdog
# Disable or extend watchdog timeout systemctl edit your-service
# Add: [Service] WatchdogSec=0 # Disable watchdog # Or extend timeout WatchdogSec=300 # 5 minutes
# Restart service systemctl daemon-reload systemctl restart your-service ```
Solution 6: Fix Timeout Issues
Services killed by timeout:
```bash # Check service timeout systemctl show your-service -p TimeoutStartUSec systemctl show your-service -p TimeoutStopUSec
# Extend timeout systemctl edit your-service
[Service] TimeoutStartSec=300 # 5 minutes to start TimeoutStopSec=60 # 1 minute to stop
# Apply systemctl daemon-reload systemctl restart your-service ```
Monitoring and Alerting
Set up monitoring to catch issues early:
```bash # Monitor OOM killer events cat > /usr/local/bin/oom-monitor.sh << 'EOF' #!/bin/bash while true; do if dmesg | grep -q "killed process"; then logger "OOM Killer Alert: $(dmesg | grep 'killed process' | tail -1)" # Send alert (adjust for your notification system) echo "OOM event detected" | mail -s "OOM Alert" admin@example.com fi sleep 60 done EOF chmod +x /usr/local/bin/oom-monitor.sh
# Run as a systemd service or in background ```
Verification
After implementing fixes, verify process stability:
```bash # Monitor process memory watch -n 1 "ps aux | grep your-app"
# Track process over time pidstat -p $(pidof your-app) 1 10
# Monitor with detailed stats pidstat -p $(pidof your-app) -r -u -d 1
# Check for recent kills dmesg | grep -i killed | tail -10
# Verify limits cat /proc/$(pidof your-app)/limits
# Check OOM score cat /proc/$(pidof your-app)/oom_score_adj ```
Prevention Best Practices
- Monitor memory usage trends with tools like Prometheus, Grafana, or Nagios
- Set up swap space to provide buffer before OOM
- Use cgroups or containers to limit and isolate resource usage
- Implement proper signal handling in applications
- Configure appropriate timeouts for long-running operations
- Use process supervisors (systemd, supervisord) for automatic restarts
- Enable core dumps for debugging unexpected crashes
- Test applications under memory pressure to identify failure modes