Fix Fluentd Buffer Overflow

Fluentd stopped collecting logs and you're seeing buffer overflow warnings in the logs. The buffer filled up because downstream services couldn't accept data fast enough, and now you risk losing critical log data. Let's diagnose and fix this systematically.

Understanding Buffer Overflow

Fluentd uses buffers to store events temporarily before forwarding them to outputs. When buffers fill beyond their limits, Fluentd either blocks incoming data or drops events depending on configuration.

Error patterns:

bash

buffer space is too small

bash

emit transaction error: buffer overflow

bash

buffer chunk has been purged because it exceeded the overflow threshold

bash

failed to flush buffer: retry timeout exceeded

Initial Diagnosis

Check Fluentd status and buffer metrics:

```bash # Check Fluentd service status systemctl status td-agent # Or for fluentd directly systemctl status fluentd

# Check Fluentd logs for buffer issues tail -n 100 /var/log/td-agent/td-agent.log | grep -i "buffer|overflow|emit"

# Check buffer file system usage ls -lah /var/log/td-agent/buffer/ df -h /var/log/td-agent/buffer/

# Check current buffer configuration cat /etc/td-agent/td-agent.conf | grep -A 30 "buffer"

# Use Fluentd metrics API curl -s http://localhost:24220/api/plugins.json | jq '.plugins[] | select(.config["@type"]=="buffer")'

# Check plugin-specific metrics curl -s http://localhost:24220/api/plugins.json | jq '.plugins[] | {id: .plugin_id, type: .config["@type"], buffer_usage: .buffer_usage}' ```

Common Cause 1: Buffer Size Too Small

The buffer limit is inadequate for your log volume.

Error pattern: ``buffer space is too small, buffer_size=8388608 buffer_usage=8388608

Diagnosis:

```bash # Check current buffer size settings grep -E "buffer_type|buffer_chunk_limit|buffer_queue_limit|buffer_total_limit_size" /etc/td-agent/td-agent.conf

# Monitor buffer usage curl -s http://localhost:24220/api/plugins.json | jq '.plugins[] | {id, buffer_stage_length, buffer_stage_byte_size, buffer_queue_length, buffer_queue_byte_size}'

# Calculate current log throughput # Check how many logs per second you're receiving curl -s http://localhost:24220/api/plugins.json | jq '.plugins[] | .emit_count'

# Check disk space for buffer storage df -h /var/log/td-agent/buffer/ ```

Solution:

Increase buffer limits in your Fluentd configuration:

```xml # /etc/td-agent/td-agent.conf <match **> @type forward # Buffer configuration <buffer> @type file path /var/log/td-agent/buffer/output_forward

# Increase chunk size (default is 8MB) chunk_limit_size 16MB

# Increase queue length (default is 256) queue_limit_length 512

# Or use total size limit instead total_limit_size 1GB

# Chunk settings for better flushing chunk_full_threshold 0.95 flush_interval 5s flush_thread_count 4 flush_chunk_bytes_size 8MB

# Retry settings retry_type exponential_backoff retry_wait 1s retry_max_interval 60s retry_timeout 24h retry_forever false </buffer> </match> ```

After updating configuration:

```bash # Restart Fluentd systemctl restart td-agent

# Verify buffer configuration curl -s http://localhost:24220/api/plugins.json | jq '.plugins[] | select(.config["@type"]=="buffer")' ```

Common Cause 2: Output Destination Down or Slow

The destination service cannot accept data, causing buffer to accumulate.

Error pattern: ``failed to flush the buffer: connection refused

bash

retry timeout exceeded: output Elasticsearch is unreachable

Diagnosis:

```bash # Check output destination connectivity curl -v http://elasticsearch:9200/_cluster/health

# Test output endpoint curl -X POST http://elasticsearch:9200/_bulk -d '{"index":{}}{"test":"data"}'

# Check Fluentd retry status curl -s http://localhost:24220/api/plugins.json | jq '.plugins[] | {id, retry_count, retry_wait}'

# Monitor flush attempts tail -f /var/log/td-agent/td-agent.log | grep -i "flush|retry|error"

# Check output plugin health curl -s http://localhost:24220/api/plugins.json | jq '.plugins[] | select(.config["@type"]=="elasticsearch" or .config["@type"]=="forward")' ```

Solution:

Fix the output destination and configure better retry logic:

```xml <match **> @type elasticsearch host elasticsearch port 9200

<buffer> @type file path /var/log/td-agent/buffer/output_es

# Buffer size chunk_limit_size 16MB queue_limit_length 256

# Flush settings - more aggressive flush_interval 3s flush_thread_count 8 flush_chunk_bytes_size 4MB

# Retry settings - longer timeout retry_type exponential_backoff retry_wait 2s retry_max_interval 120s retry_timeout 72h

# Overflow action - block instead of drop overflow_action block </buffer> </match> ```

If the destination is truly unavailable:

```bash # Clear stale buffer files (WARNING: lose buffered data) rm -rf /var/log/td-agent/buffer/output_es.*

# Restart with clean buffer systemctl restart td-agent ```

Common Cause 3: Overflow Action Configuration

The overflow action setting determines behavior when buffer fills completely.

Error pattern: ``emit transaction error: buffer overflow, emit_records dropped

Diagnosis:

```bash # Check current overflow action grep -E "overflow_action" /etc/td-agent/td-agent.conf

# Check if events are being dropped curl -s http://localhost:24220/api/plugins.json | jq '.plugins[] | {id, emit_records, emit_count, emit_size}'

# Look for dropped event messages grep -i "dropped|purged" /var/log/td-agent/td-agent.log | tail -20 ```

Solution:

Configure appropriate overflow action:

```xml <buffer> # Overflow action options: # - block: Block incoming data until buffer has space (default, prevents data loss) # - drop_oldest_chunk: Drop oldest buffered data to make room for new # - drop_new_chunk: Drop incoming data when buffer full

overflow_action block

# For scenarios where blocking is unacceptable: # overflow_action drop_oldest_chunk </buffer> ```

For critical systems where blocking causes cascading failures:

```xml <buffer> overflow_action drop_oldest_chunk

# Configure a threshold before dropping chunk_full_threshold 0.98 </buffer> ```

Common Cause 4: Slow Flush Rate

Flush settings are too conservative, causing buffer accumulation.

Error pattern: ``buffer queue is growing: queue_length=500

Diagnosis:

```bash # Check flush settings grep -E "flush_interval|flush_thread_count|flush_chunk_bytes_size" /etc/td-agent/td-agent.conf

# Monitor buffer growth watch -n 5 'curl -s http://localhost:24220/api/plugins.json | jq ".plugins[] | {id, buffer_queue_length, buffer_stage_length}"'

# Calculate flush rate vs intake rate curl -s http://localhost:24220/api/plugins.json | jq '.plugins[] | .emit_records, .flush_count' ```

Solution:

Optimize flush configuration:

```xml <buffer> # Increase flush threads for parallel flushing flush_thread_count 8

# Reduce flush interval for more frequent writes flush_interval 1s

# Set flush chunk size for efficient writes flush_chunk_bytes_size 8MB

# Flush when chunk reaches threshold flush_mode interval flush_chunk_full_threshold 0.9

# Time-based flush to prevent stale data timekey 60 timekey_wait 30s </buffer> ```

Common Cause 5: Disk Space Exhaustion

Buffer storage directory runs out of disk space.

Error pattern: ``failed to write buffer chunk: No space left on device

Diagnosis:

```bash # Check disk space df -h /var/log/td-agent/buffer/

# Check buffer directory size du -sh /var/log/td-agent/buffer/

# Check buffer file count ls -la /var/log/td-agent/buffer/ | wc -l

# Monitor disk usage during operation watch -n 5 'df -h /var/log/td-agent/buffer/' ```

Solution:

Free disk space and relocate buffer:

```bash # Delete old buffer chunks if destination is healthy rm -f /var/log/td-agent/buffer/*.buffer.*

# Or move buffer to larger storage mkdir -p /mnt/larger-storage/fluentd-buffer

# Update configuration ```

```xml <buffer> @type file path /mnt/larger-storage/fluentd-buffer/output

# Limit buffer size to prevent disk exhaustion total_limit_size 2GB </buffer> ```

Common Cause 6: Buffer File Corruption

Buffer files can become corrupted, preventing proper flushing.

Error pattern: ``failed to read buffer chunk: invalid metadata format

bash

buffer chunk is corrupted: checksum mismatch

Diagnosis:

```bash # Look for corruption errors grep -i "corrupt|invalid|checksum" /var/log/td-agent/td-agent.log

# Check buffer file integrity ls -la /var/log/td-agent/buffer/

# Test reading buffer metadata hexdump -C /var/log/td-agent/buffer/*.meta | head -20 ```

Solution:

Remove corrupted buffer files:

```bash # Stop Fluentd systemctl stop td-agent

# List and inspect buffer files ls -lah /var/log/td-agent/buffer/

# Remove corrupted chunks (WARNING: lose buffered data) rm /var/log/td-agent/buffer/*.buffer rm /var/log/td-agent/buffer/*.meta

# Start Fluentd fresh systemctl start td-agent ```

Common Cause 7: High Input Rate

Incoming log volume exceeds processing capacity.

Error pattern: ``emit transaction error: buffer overflow, emit_records=5000

Diagnosis:

```bash # Measure incoming rate curl -s http://localhost:24220/api/plugins.json | jq '.plugins[] | select(.config["@type"]=="tail" or .config["@type"]=="forward")'

# Check log file growth rate ls -lh /var/log/application.log

# Monitor emit rate watch -n 5 'curl -s http://localhost:24220/api/plugins.json | jq ".plugins[] | {id, emit_count, emit_records}"' ```

Solution:

Increase parallelism and optimize input:

```xml # Input configuration <source> @type tail path /var/log/application.log pos_file /var/log/td-agent/application.log.pos tag app.logs

# Increase read chunk size read_lines_limit 1000

# Parse configuration <parse> @type regexp # Use simpler/faster parsing if possible </parse> </source>

# Use multiple buffer threads <buffer> flush_thread_count 16 chunk_limit_size 32MB queue_limit_length 512 </buffer> ```

Consider adding a secondary Fluentd instance:

```xml # Forward to aggregator for better throughput <match **> @type forward <server> host aggregator-fluentd port 24224 </server>

<buffer> chunk_limit_size 16MB flush_interval 1s flush_thread_count 4 </buffer> </match> ```

Verification

After fixing buffer issues, verify everything works:

```bash # Check Fluentd is running systemctl status td-agent

# Monitor buffer usage curl -s http://localhost:24220/api/plugins.json | jq '.plugins[] | {id, buffer_queue_length, buffer_stage_byte_size, buffer_total_size}'

# Verify data is flowing to output curl -s http://elasticsearch:9200/_cat/indices?v | grep logs

# Check for any remaining errors tail -n 50 /var/log/td-agent/td-agent.log | grep -i "error|overflow|fail"

# Test end-to-end flow echo "Test log at $(date)" >> /var/log/application.log sleep 5 curl -s 'http://elasticsearch:9200/logs*/_search?q=test%20log' | jq '.hits.total' ```

Prevention

Set up buffer monitoring:

```yaml groups: - name: fluentd_health rules: - alert: FluentdBufferOverflow expr: fluentd_buffer_usage_ratio > 0.9 for: 2m labels: severity: critical annotations: summary: "Fluentd buffer near overflow on {{ $labels.instance }}"

alert: FluentdBufferFull
expr: fluentd_buffer_queue_length >= fluentd_buffer_queue_limit
for: 1m
labels:
severity: critical
annotations:
summary: "Fluentd buffer queue is full"

alert: FluentdFlushFailure
expr: increase(fluentd_flush_failure_count[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Fluentd experiencing flush failures"
`

Enable Fluentd Prometheus metrics:

```xml # Add metrics output <source> @type prometheus bind 0.0.0.0 port 24231 metrics_path /metrics </source>

<source> @type prometheus_output_monitor <labels> hostname ${hostname} </labels> </source> ```

Buffer overflow usually results from insufficient buffer size, slow output destinations, or mismatched flush rates. Start by checking buffer usage and output connectivity, then adjust configuration accordingly.

Understanding Buffer Overflow

Initial Diagnosis

Common Cause 1: Buffer Size Too Small

Common Cause 2: Output Destination Down or Slow

Common Cause 3: Overflow Action Configuration

Common Cause 4: Slow Flush Rate

Common Cause 5: Disk Space Exhaustion

Common Cause 6: Buffer File Corruption

Common Cause 7: High Input Rate

Verification

Prevention

Share this guide

More Monitoring Troubleshooting Guides

Metric Retention Expired

Timeseries Storage Full

Collector Agent Crashed

Webhook Notification Timeout

SMS Notification Failed

Email Notification Bounced