Your application keeps crashing with no error message. Services restart unexpectedly. A long-running job terminates mid-execution. When processes die without clear errors, tracking down the cause requires systematic investigation.

Understanding the Problem

Processes can be killed for many reasons: the OOM killer, segmentation faults, signal handling, resource limits, watchdogs, or explicit termination. Each leaves different traces.

Typical Symptoms

bash
Process 12345 terminated unexpectedly
Job 1 'python script.py' terminated by signal 9 (Killed)
Segmentation fault (core dumped)
Trace/breakpoint trap (core dumped)
Killed

You might notice: - Services restarting frequently - Long-running jobs dying mid-execution - No useful error messages in application logs - Process disappears from ps output without explanation

Diagnosing the Issue

Step 1: Check Kernel Logs for OOM Killer

The most common cause of unexpected kills is the OOM killer:

```bash # Check for OOM killer activity dmesg | grep -i "killed process" dmesg | grep -i "out of memory" dmesg | grep -i "oom"

# Check recent kernel messages journalctl -k --since "1 hour ago" | grep -i -E "(killed|oom|memory)"

# Alternative log locations grep -i "killed process" /var/log/syslog grep -i "out of memory" /var/log/messages ```

OOM killer output looks like: `` Out of memory: Killed process 1842 (java) total-vm:8388608kB, anon-rss:4194304kB

Step 2: Check for Segmentation Faults

```bash # Look for segfaults in kernel logs dmesg | grep -i segfault

# Example output: # python[12345]: segfault at 0 ip 00007f8c4a2b3f91 sp 00007ffc3a8e9a80 error 4 in libpython3.8.so

# Check core dump settings cat /proc/sys/kernel/core_pattern

# Find core dumps find /var/crash -type f -mtime -1 2>/dev/null ls -la /var/lib/systemd/coredump/ ```

Step 3: Check Process Exit Codes and Signals

```bash # If running in shell, check last exit code echo $?

# Common signal codes: # 1 - SIGHUP (hangup, terminal closed) # 2 - SIGINT (interrupt, Ctrl+C) # 9 - SIGKILL (forced kill, cannot be caught) # 11 - SIGSEGV (segmentation fault) # 15 - SIGTERM (normal termination request)

# Check systemd service exit codes systemctl status service-name

# Show service logs journalctl -u service-name -n 100 ```

Step 4: Check Resource Limits

```bash # Check limits for a running process cat /proc/$(pidof process-name)/limits

# Check shell limits ulimit -a

# Key limits to check: ulimit -c # Core file size ulimit -d # Data segment size ulimit -f # File size ulimit -n # Open files ulimit -s # Stack size ulimit -t # CPU time ulimit -v # Virtual memory ```

Step 5: Check for Watchdogs and Supervisors

```bash # Check if systemd watchdog is configured systemctl show service-name | grep Watchdog

# Check for process supervisors ps aux | grep -E "(supervisord|monit|god|runit|s6)"

# Check cron for process monitoring crontab -l grep -r "process" /etc/cron.* ```

Solutions by Cause

Solution 1: OOM Killer Prevention

If the OOM killer is terminating your process:

```bash # Check OOM score of your process cat /proc/$(pidof your-app)/oom_score cat /proc/$(pidof your-app)/oom_score_adj

# Protect the process (lower score = less likely killed) # Range: -1000 (never kill) to 1000 (likely kill) echo -500 > /proc/$(pidof your-app)/oom_score_adj

# For systemd services, add to service file: # [Service] # OOMScoreAdjust=-500

# Or completely protect from OOM: echo -1000 > /proc/$(pidof your-app)/oom_score_adj ```

Adjust system memory settings:

```bash # Reduce swappiness sysctl -w vm.swappiness=10

# Configure overcommit sysctl -w vm.overcommit_memory=2 sysctl -w vm.overcommit_ratio=80

# Make persistent cat >> /etc/sysctl.conf << EOF vm.swappiness = 10 vm.overcommit_memory = 2 vm.overcommit_ratio = 80 EOF ```

Solution 2: Fix Segmentation Faults

Segfaults usually indicate bugs in the application:

```bash # Enable core dumps for debugging ulimit -c unlimited

# Install debug symbols for the application apt-get install package-dbgsym # Debian/Ubuntu debuginfo-install package # RHEL/CentOS

# Run with gdb to catch the crash gdb --args ./your-application # Then type 'run' in gdb, and 'bt' after crash

# Analyze existing core dump gdb /path/to/binary /path/to/core # Then: bt full ```

Common segfault causes: - Null pointer dereference - Buffer overflow - Stack overflow - Use after free - Accessing invalid memory

Solution 3: Adjust Resource Limits

```bash # For the current shell session ulimit -n 65535 # Increase open files ulimit -u 4096 # Increase user processes ulimit -v unlimited # Remove virtual memory limit

# For systemd services, edit the service file: # [Service] # LimitNOFILE=65535 # LimitNPROC=4096 # LimitSIGPENDING=4096

# Create override for existing service systemctl edit service-name

# Add: [Service] LimitNOFILE=65535 LimitNPROC=4096

# Apply systemctl daemon-reload systemctl restart service-name ```

For PAM limits (all users):

```bash # Edit /etc/security/limits.conf echo '* soft nofile 65535' >> /etc/security/limits.conf echo '* hard nofile 65535' >> /etc/security/limits.conf echo '* soft nproc 4096' >> /etc/security/limits.conf echo '* hard nproc 4096' >> /etc/security/limits.conf

# User must log out and back in for changes to take effect ```

Solution 4: Handle Signals Properly

If your application doesn't handle signals gracefully:

```bash # Test if application handles SIGTERM kill -TERM $(pidof your-app)

# Common signals and their meanings: # SIGHUP (1) - Reload config, reopen log files # SIGINT (2) - Interrupt (Ctrl+C) # SIGTERM (15) - Normal termination # SIGKILL (9) - Force kill (cannot be caught)

# For applications that don't handle SIGHUP when terminal closes nohup ./your-app &

# Or use screen/tmux tmux new-session -d -s app './your-app' tmux attach -t app ```

Solution 5: Fix Systemd Watchdog

If systemd watchdog is killing your process:

```bash # Check watchdog settings systemctl show your-service | grep Watchdog

# Disable or extend watchdog timeout systemctl edit your-service

# Add: [Service] WatchdogSec=0 # Disable watchdog # Or extend timeout WatchdogSec=300 # 5 minutes

# Restart service systemctl daemon-reload systemctl restart your-service ```

Solution 6: Fix Timeout Issues

Services killed by timeout:

```bash # Check service timeout systemctl show your-service -p TimeoutStartUSec systemctl show your-service -p TimeoutStopUSec

# Extend timeout systemctl edit your-service

[Service] TimeoutStartSec=300 # 5 minutes to start TimeoutStopSec=60 # 1 minute to stop

# Apply systemctl daemon-reload systemctl restart your-service ```

Monitoring and Alerting

Set up monitoring to catch issues early:

```bash # Monitor OOM killer events cat > /usr/local/bin/oom-monitor.sh << 'EOF' #!/bin/bash while true; do if dmesg | grep -q "killed process"; then logger "OOM Killer Alert: $(dmesg | grep 'killed process' | tail -1)" # Send alert (adjust for your notification system) echo "OOM event detected" | mail -s "OOM Alert" admin@example.com fi sleep 60 done EOF chmod +x /usr/local/bin/oom-monitor.sh

# Run as a systemd service or in background ```

Verification

After implementing fixes, verify process stability:

```bash # Monitor process memory watch -n 1 "ps aux | grep your-app"

# Track process over time pidstat -p $(pidof your-app) 1 10

# Monitor with detailed stats pidstat -p $(pidof your-app) -r -u -d 1

# Check for recent kills dmesg | grep -i killed | tail -10

# Verify limits cat /proc/$(pidof your-app)/limits

# Check OOM score cat /proc/$(pidof your-app)/oom_score_adj ```

Prevention Best Practices

  • Monitor memory usage trends with tools like Prometheus, Grafana, or Nagios
  • Set up swap space to provide buffer before OOM
  • Use cgroups or containers to limit and isolate resource usage
  • Implement proper signal handling in applications
  • Configure appropriate timeouts for long-running operations
  • Use process supervisors (systemd, supervisord) for automatic restarts
  • Enable core dumps for debugging unexpected crashes
  • Test applications under memory pressure to identify failure modes