The Problem
You're seeing Prometheus crash with memory-related errors, or the OOM killer is terminating the process. The logs might show:
level=error ts=2026-04-04T09:15:32.789Z caller=db.go:892 msg="out of memory"
level=fatal ts=2026-04-04T09:15:32.790Z caller=main.go:345 err="runtime error: out of memory"Or from the kernel:
Apr 4 09:15:32 monitoring kernel: Out of memory: Kill process 1842 (prometheus) score 890 or sacrifice child
Apr 4 09:15:32 monitoring kernel: Killed process 1842 (prometheus) total-vm:8388608kB, anon-rss:7864320kBThis happens when Prometheus consumes more memory than available, typically during high-cardinality queries, excessive series churn, or insufficient head block configuration.
Diagnosis
Check Current Memory Usage
```promql # Process memory usage process_resident_memory_bytes{job="prometheus"}
# Go memory stats go_memstats_heap_inuse_bytes{job="prometheus"} go_memstats_heap_alloc_bytes{job="prometheus"}
# Memory limit if set prometheus_config_memory_limit_bytes ```
Identify High Cardinality Series
```promql # Top 10 metrics by series count topk(10, count by (__name__)({__name__=~".+"}))
# Series with highest label combinations topk(10, count by (job, instance)({__name__=~".+"})) ```
Check Head Block Memory
```promql # Head series count prometheus_tsdb_head_series{job="prometheus"}
# Head chunks count prometheus_tsdb_head_chunks{job="prometheus"}
# Head memory usage prometheus_tsdb_head_mem_chunk_count{job="prometheus"} ```
Solutions
1. Increase Memory Limit
Edit your Prometheus startup configuration:
# prometheus.yml - not directly related but ensure minimal config
global:
scrape_interval: 15s
evaluation_interval: 15sUpdate the systemd service or Docker configuration:
```bash # Systemd - /etc/systemd/system/prometheus.service [Service] MemoryMax=8G MemoryHigh=7G
# Docker docker run -d \ --name prometheus \ --memory="8g" \ --memory-swap="8g" \ prom/prometheus:latest ```
2. Reduce Head Block Retention
Lower the time series stored in memory:
prometheus \
--storage.tsdb.head-fullness.percentage.max=75 \
--storage.tsdb.retention.time=15d \
--storage.tsdb.retention.size=50GB3. Limit Series and Samples
Enforce hard limits to prevent runaway cardinality:
prometheus \
--storage.tsdb.max-block-duration=2h \
--storage.tsdb.wal-segment-size=50MB \
--query.max-samples=50000000Add sample limits in configuration:
```yaml # prometheus.yml global: scrape_interval: 15s scrape_timeout: 10s
# Limit samples per scrape scrape_configs: - job_name: 'kubernetes-pods' sample_limit: 5000 label_limit: 30 label_name_length_limit: 200 label_value_length_limit: 200 ```
4. Optimize Recording Rules
Replace expensive queries with recording rules:
```yaml # recording_rules.yml groups: - name: memory_optimization interval: 30s rules: - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
- record: instance:memory_usage:percentage
- expr: 100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
`
5. Enable Memory Ballast
For Go-based applications, pre-allocate memory:
# This is handled internally by Prometheus v2.40+
# But you can set GOGC to control garbage collection
export GOGC=50
prometheus --config.file=prometheus.ymlVerification
After applying changes, verify memory stability:
```promql # Memory should stay below 80% of limit process_resident_memory_bytes / prometheus_config_memory_limit_bytes < 0.8
# Head series should be stable delta(prometheus_tsdb_head_series[1h]) < 10000
# No OOM events in logs # Check with: # journalctl -u prometheus --since "1 hour ago" | grep -i "out of memory" ```
Prevention
Set up alerting for memory pressure:
```yaml # alert_rules.yml groups: - name: prometheus_memory rules: - alert: PrometheusMemoryHigh expr: process_resident_memory_bytes{job="prometheus"} > 6 * 1024 * 1024 * 1024 for: 5m labels: severity: warning annotations: summary: "Prometheus memory usage high" description: "Memory usage is {{ $value | humanizeBytes }}"
- alert: PrometheusMemoryCritical
- expr: process_resident_memory_bytes{job="prometheus"} > 7 * 1024 * 1024 * 1024
- for: 2m
- labels:
- severity: critical
- annotations:
- summary: "Prometheus approaching memory limit"
`
Regular monitoring of cardinality growth:
# Alert on cardinality growth
delta(prometheus_tsdb_head_series[1h]) > 50000