Introduction

The G1 (Garbage-First) collector aims to meet pause time targets by dividing the heap into regions and collecting the ones with the most garbage first. When the heap is undersized, allocation rate is high, or large objects fill regions quickly, G1 must perform full GCs or mixed collections that exceed the target pause time. This causes request latency spikes, timeout errors, and SLA violations.

Symptoms

  • P99 latency spikes correlating with GC pauses
  • GC pause (G1 Evacuation Pause) (mixed) 450ms exceeding target
  • Full GC events: Pause Full (Allocation Failure) 1200ms
  • Application timeouts during GC pauses
  • G1 humongous allocation filling regions with large objects

``` [2024-01-15T10:30:00.123+0000] GC(45) Pause Young (Normal) (G1 Evacuation Pause) [2024-01-15T10:30:00.123+0000] GC(45) Pause Young (Normal) 1024M->856M(2048M) 380.5ms # Target was 200ms, actual was 380ms - SLA violation!

[2024-01-15T10:30:15.456+0000] GC(46) Pause Full (Allocation Failure) [2024-01-15T10:30:15.456+0000] GC(46) Pause Full 1800M->1200M(2048M) 1450.2ms # Full GC: stop-the-world for 1.45 seconds! ```

Common Causes

  • MaxGCPauseMillis target too aggressive for the heap size
  • Heap too small causing frequent collections
  • Humongous objects (> half region size) fragmenting the heap
  • Promotion failure causing full GC
  • Allocation rate exceeding GC throughput capacity

Step-by-Step Fix

  1. 1.Analyze GC logs:
  2. 2.```bash
  3. 3.# Enable detailed GC logging
  4. 4.java -Xlog:gc*:file=gc.log:time,uptime,level,tags \
  5. 5.-jar app.jar

# Analyze with GCViewer or GCEasy # Look for: # - Pause times exceeding MaxGCPauseMillis # - Full GC frequency # - Heap occupancy before/after GC

# Quick analysis with grep grep "Pause" gc.log | awk '{print $NF}' | sort -n | tail -10 ```

  1. 1.Tune G1 collector settings:
  2. 2.```bash
  3. 3.# Set realistic pause target (200-500ms for most apps)
  4. 4.java -XX:+UseG1GC \
  5. 5.-XX:MaxGCPauseMillis=300 \
  6. 6.-Xms4g -Xmx4g \ # Same min and max
  7. 7.-XX:G1HeapRegionSize=16m \ # Default is auto-detected
  8. 8.-XX:InitiatingHeapOccupancyPercent=45 \ # Start mixed GC earlier
  9. 9.-jar app.jar
  10. 10.`
  11. 11.Reduce humongous allocations:
  12. 12.```bash
  13. 13.# G1 region size should be larger than your biggest objects
  14. 14.# Default regions: heap_size / 2048 (min 1MB, max 32MB)
  15. 15.# Humongous threshold: region_size / 2

# If you have many 10MB objects, set region size to 32MB java -XX:+UseG1GC \ -XX:G1HeapRegionSize=32m \ -Xmx8g \ -jar app.jar

# Now objects up to 16MB are not humongous ```

  1. 1.Increase heap to reduce GC frequency:
  2. 2.```bash
  3. 3.# Larger heap = fewer collections = fewer pause opportunities
  4. 4.# But: larger heap = longer individual pauses
  5. 5.# Find the sweet spot with load testing

java -Xms8g -Xmx8g \ # Double the heap -XX:+UseG1GC \ -XX:MaxGCPauseMillis=300 \ -XX:+AlwaysPreTouch \ # Touch all pages at startup -jar app.jar ```

  1. 1.Consider ZGC for sub-millisecond pauses:
  2. 2.```bash
  3. 3.# Java 15+ ZGC - pauses under 1ms regardless of heap size
  4. 4.java -XX:+UseZGC \
  5. 5.-Xmx16g \
  6. 6.-XX:+ZGenerational \ # Java 21+ generational ZGC
  7. 7.-jar app.jar

# For latency-critical applications with large heaps ```

Prevention

  • Monitor GC metrics with JMX: java.lang:type=GarbageCollector
  • Set up Grafana dashboards tracking GC pause times and frequency
  • Alert when P99 pause time exceeds 80% of MaxGCPauseMillis
  • Load test with production-like data volumes to size heap correctly
  • Use -XX:+PrintGCDetails and analyze logs after every deployment
  • Consider -XX:+UseStringDeduplication for string-heavy applications
  • In Kubernetes, set resource requests/limits accounting for heap: memory = Xmx + Metaspace + 25% overhead