The Problem

You're seeing Prometheus crash with memory-related errors, or the OOM killer is terminating the process. The logs might show:

bash
level=error ts=2026-04-04T09:15:32.789Z caller=db.go:892 msg="out of memory"
level=fatal ts=2026-04-04T09:15:32.790Z caller=main.go:345 err="runtime error: out of memory"

Or from the kernel:

bash
Apr  4 09:15:32 monitoring kernel: Out of memory: Kill process 1842 (prometheus) score 890 or sacrifice child
Apr  4 09:15:32 monitoring kernel: Killed process 1842 (prometheus) total-vm:8388608kB, anon-rss:7864320kB

This happens when Prometheus consumes more memory than available, typically during high-cardinality queries, excessive series churn, or insufficient head block configuration.

Diagnosis

Check Current Memory Usage

```promql # Process memory usage process_resident_memory_bytes{job="prometheus"}

# Go memory stats go_memstats_heap_inuse_bytes{job="prometheus"} go_memstats_heap_alloc_bytes{job="prometheus"}

# Memory limit if set prometheus_config_memory_limit_bytes ```

Identify High Cardinality Series

```promql # Top 10 metrics by series count topk(10, count by (__name__)({__name__=~".+"}))

# Series with highest label combinations topk(10, count by (job, instance)({__name__=~".+"})) ```

Check Head Block Memory

```promql # Head series count prometheus_tsdb_head_series{job="prometheus"}

# Head chunks count prometheus_tsdb_head_chunks{job="prometheus"}

# Head memory usage prometheus_tsdb_head_mem_chunk_count{job="prometheus"} ```

Solutions

1. Increase Memory Limit

Edit your Prometheus startup configuration:

yaml
# prometheus.yml - not directly related but ensure minimal config
global:
  scrape_interval: 15s
  evaluation_interval: 15s

Update the systemd service or Docker configuration:

```bash # Systemd - /etc/systemd/system/prometheus.service [Service] MemoryMax=8G MemoryHigh=7G

# Docker docker run -d \ --name prometheus \ --memory="8g" \ --memory-swap="8g" \ prom/prometheus:latest ```

2. Reduce Head Block Retention

Lower the time series stored in memory:

bash
prometheus \
  --storage.tsdb.head-fullness.percentage.max=75 \
  --storage.tsdb.retention.time=15d \
  --storage.tsdb.retention.size=50GB

3. Limit Series and Samples

Enforce hard limits to prevent runaway cardinality:

bash
prometheus \
  --storage.tsdb.max-block-duration=2h \
  --storage.tsdb.wal-segment-size=50MB \
  --query.max-samples=50000000

Add sample limits in configuration:

```yaml # prometheus.yml global: scrape_interval: 15s scrape_timeout: 10s

# Limit samples per scrape scrape_configs: - job_name: 'kubernetes-pods' sample_limit: 5000 label_limit: 30 label_name_length_limit: 200 label_value_length_limit: 200 ```

4. Optimize Recording Rules

Replace expensive queries with recording rules:

```yaml # recording_rules.yml groups: - name: memory_optimization interval: 30s rules: - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))

  • record: instance:memory_usage:percentage
  • expr: 100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
  • `

5. Enable Memory Ballast

For Go-based applications, pre-allocate memory:

bash
# This is handled internally by Prometheus v2.40+
# But you can set GOGC to control garbage collection
export GOGC=50
prometheus --config.file=prometheus.yml

Verification

After applying changes, verify memory stability:

```promql # Memory should stay below 80% of limit process_resident_memory_bytes / prometheus_config_memory_limit_bytes < 0.8

# Head series should be stable delta(prometheus_tsdb_head_series[1h]) < 10000

# No OOM events in logs # Check with: # journalctl -u prometheus --since "1 hour ago" | grep -i "out of memory" ```

Prevention

Set up alerting for memory pressure:

```yaml # alert_rules.yml groups: - name: prometheus_memory rules: - alert: PrometheusMemoryHigh expr: process_resident_memory_bytes{job="prometheus"} > 6 * 1024 * 1024 * 1024 for: 5m labels: severity: warning annotations: summary: "Prometheus memory usage high" description: "Memory usage is {{ $value | humanizeBytes }}"

  • alert: PrometheusMemoryCritical
  • expr: process_resident_memory_bytes{job="prometheus"} > 7 * 1024 * 1024 * 1024
  • for: 2m
  • labels:
  • severity: critical
  • annotations:
  • summary: "Prometheus approaching memory limit"
  • `

Regular monitoring of cardinality growth:

promql
# Alert on cardinality growth
delta(prometheus_tsdb_head_series[1h]) > 50000