Your OpenTelemetry Collector is failing to process telemetry data, showing errors in logs, or metrics/traces/logs aren't reaching their destinations. The Collector is central to your observability pipeline, so errors here cascade across all monitoring.

Understanding Collector Architecture

OpenTelemetry Collector has four main components:

  • Receivers: Accept telemetry from various sources
  • Processors: Transform/enrich telemetry data
  • Exporters: Send telemetry to backends
  • Extensions: Additional functionality (health check, zpages, etc.)

Error patterns:

bash
failed to push telemetry: exporter is down
bash
receiver error: connection refused
bash
processor memory_limiter exceeded
bash
pipeline not initialized: configuration error

Initial Diagnosis

Check Collector status and logs:

```bash # Check Collector pod/service status kubectl get pods -l app=otel-collector -n monitoring systemctl status otel-collector

# Check Collector logs kubectl logs -l app=otel-collector -n monitoring | grep -i "error|fail" journalctl -u otel-collector | grep -i "error|fail"

# Check Collector metrics endpoint curl -s http://localhost:8888/metrics | grep -E "otelcol_receiver|otelcol_exporter|otelcol_processor"

# Check Collector health curl -s http://localhost:13133/ # Health extension

# Get Collector configuration kubectl get configmap otel-collector-config -n monitoring -o yaml

# Check internal telemetry curl -s http://localhost:8888/metrics | grep otelcol_process_uptime ```

Common Cause 1: Configuration Syntax Errors

Invalid YAML or wrong configuration structure.

Error pattern: `` failed to load config: yaml: line 10: mapping values are not allowed here

bash
service::pipelines::traces: no receivers defined

Diagnosis:

```bash # Check configuration file syntax cat /etc/otelcol/config.yaml

# Validate YAML syntax python -c "import yaml; yaml.safe_load(open('/etc/otelcol/config.yaml'))"

# Or use yq yq eval /etc/otelcol/config.yaml

# Check Collector startup logs for config errors kubectl logs -l app=otel-collector -n monitoring | grep -i "config|error" | head -20

# Use config validation tool if available otelcol validate --config=/etc/otelcol/config.yaml ```

Solution:

Fix configuration syntax:

```yaml # Correct OpenTelemetry Collector configuration structure receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318

processors: batch: timeout: 10s send_batch_size: 1024 memory_limiter: check_interval: 1s limit_mib: 512

exporters: otlp: endpoint: tempo:4317 tls: insecure: true prometheus: endpoint: 0.0.0.0:9090

service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus] logs: receivers: [otlp] processors: [batch] exporters: [otlphttp]

extensions: health_check: endpoint: 0.0.0.0:13133 zpages: endpoint: 0.0.0.0:55679 ```

Common syntax fixes:

```yaml # WRONG: Missing receiver in pipeline service: pipelines: traces: processors: [batch] exporters: [otlp]

# CORRECT: Include receivers service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp]

# WRONG: Invalid processor reference processors: memory_limiter: limit: 512 # Wrong field name

# CORRECT: Use correct field processors: memory_limiter: limit_mib: 512 ```

Common Cause 2: Exporter Connection Failures

Exporters cannot reach destination backends.

Error pattern: `` Exporter "otlp" failed to export items: connection refused

bash
failed to push to endpoint: context deadline exceeded

Diagnosis:

```bash # Check exporter metrics curl -s http://localhost:8888/metrics | grep -E "otelcol_exporter.*failed|otelcol_exporter.*sent"

# Test backend connectivity curl -v http://tempo:4317/v1/traces curl -v http://prometheus:9090/-/healthy curl -v http://loki:3100/ready

# Check from Collector pod kubectl exec -it otel-collector-pod -- curl http://tempo:4317/v1/traces

# Check exporter errors in logs kubectl logs -l app=otel-collector -n monitoring | grep -i "exporter|failed"

# Monitor export rate curl -s http://localhost:8888/metrics | grep "rate(otelcol_exporter_sent_spans_total[5m])" ```

Solution:

Fix exporter connectivity:

```yaml # OTLP exporter configuration exporters: otlp/tempo: endpoint: tempo:4317 tls: insecure: true timeout: 30s retry_on_failure: enabled: true initial_interval: 5s max_interval: 30s max_elapsed_time: 300s sending_queue: enabled: true num_consumers: 10 queue_size: 5000

prometheusremotewrite: endpoint: http://prometheus:9090/api/v1/write tls: insecure: true

otlphttp/loki: endpoint: http://loki:3100/loki/api/v1/push tls: insecure: true

# Verify network connectivity # For Kubernetes, check service endpoints kubectl get endpoints tempo -n monitoring kubectl get endpoints prometheus -n monitoring ```

Common Cause 3: Memory Limiter Exceeded

Memory limiter processor dropping data when memory limit hit.

Error pattern: `` Memory limiter exceeded: dropping telemetry

bash
otelcol_processor_refused_spans_total increasing

Diagnosis:

```bash # Check memory limiter metrics curl -s http://localhost:8888/metrics | grep -E "otelcol_processor_memory_limiter"

# Check refused items curl -s http://localhost:8888/metrics | grep otelcol_processor_refused

# Monitor Collector memory usage kubectl top pods -l app=otel-collector -n monitoring ps aux | grep otelcol

# Check memory configuration kubectl describe pod -l app=otel-collector -n monitoring | grep -A 5 "memory_limiter"

# Look for memory errors in logs kubectl logs -l app=otel-collector -n monitoring | grep -i "memory|limit" ```

Solution:

Adjust memory limits:

```yaml # Memory limiter processor configuration processors: memory_limiter: check_interval: 1s limit_mib: 1024 # Increase based on available memory spike_limit_mib: 256

# For container deployment, also set container limits # Kubernetes deployment resources: limits: memory: 2Gi requests: memory: 1Gi

# Increase buffer sizes if needed processors: batch: timeout: 10s send_batch_size: 1024 send_batch_max_size: 2048

# If memory is still exceeded, consider sampling processors: probabilistic_sampler: sampling_percentage: 50 # Sample 50% of traces ```

Common Cause 4: Receiver Protocol Issues

Receivers not accepting telemetry from sources.

Error pattern: `` Receiver error: failed to accept connection

bash
otlp receiver: grpc protocol error

Diagnosis:

```bash # Check receiver metrics curl -s http://localhost:8888/metrics | grep -E "otelcol_receiver.*accepted|otelcol_receiver.*refused"

# Check if receivers are listening netstat -tlnp | grep 4317 netstat -tlnp | grep 4318

# Test receiver connectivity grpcurl -plaintext localhost:4317 list curl -v http://localhost:4318/v1/traces

# Check receiver errors kubectl logs -l app=otel-collector -n monitoring | grep -i "receiver|protocol"

# Check incoming telemetry rate curl -s http://localhost:8888/metrics | grep "rate(otelcol_receiver_accepted_spans_total[5m])" ```

Solution:

Fix receiver configuration:

```yaml # OTLP receiver configuration receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 max_recv_msg_size_mib: 16 keepalive: server_parameters: max_connection_age: 120s max_connection_age_grace: 30s enforcement_policy: min_time_between_pings: 10s ping_without_stream_allowed: false http: endpoint: 0.0.0.0:4318 max_request_body_size: 16MiB cors: allowed_origins: - "*"

# For Jaeger receiver receivers: jaeger: protocols: grpc: endpoint: 0.0.0.0:14250 thrift_http: endpoint: 0.0.0.0:14268 thrift_binary: endpoint: 0.0.0.0:6832

# For Kafka receiver receivers: kafka: brokers: ["kafka:9092"] topic: "otel-spans" encoding: otlp_proto group_id: "otel-collector" ```

Common Cause 5: Pipeline Not Started

Pipelines fail to initialize due to missing components.

Error pattern: `` Pipeline "traces" not started: no exporters

bash
Service pipeline initialization failed

Diagnosis:

```bash # Check pipeline status in logs kubectl logs -l app=otel-collector -n monitoring | grep -i "pipeline|initialized|started"

# Verify service configuration cat /etc/otelcol/config.yaml | grep -A 20 "service:"

# Check component registration kubectl logs -l app=otel-collector -n monitoring --since=5m | grep -i "registered"

# Check metrics for pipeline activity curl -s http://localhost:8888/metrics | grep otelcol_processor_spans_processed_total ```

Solution:

Fix pipeline definition:

```yaml # Complete pipeline configuration receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317

processors: batch: timeout: 10s

exporters: otlp: endpoint: tempo:4317 tls: insecure: true

# Service MUST reference defined components service: pipelines: traces: receivers: [otlp] # Must be defined in receivers section processors: [batch] # Must be defined in processors section exporters: [otlp] # Must be defined in exporters section extensions: [health_check]

extensions: health_check: endpoint: 0.0.0.0:13133

# Each component name in pipeline must match exactly # Common mistake: typo in component name service: pipelines: traces: receivers: [otlp_receivr] # WRONG - typo exporters: [otlp_exporter] # WRONG - name doesn't match ```

Common Cause 6: Batch Processor Timeout

Batch processor holding data too long before sending.

Error pattern: `` Batch processor timeout exceeded

bash
Data delayed in batch processor

Diagnosis:

```bash # Check batch processor metrics curl -s http://localhost:8888/metrics | grep -E "otelcol_processor_batch"

# Check batch latency curl -s http://localhost:8888/metrics | grep "otelcol_processor_batch_latency"

# Monitor batch sizes curl -s http://localhost:8888/metrics | grep "otelcol_processor_batch_batch_send_size"

# Check for timeout issues in logs kubectl logs -l app=otel-collector -n monitoring | grep -i "batch|timeout" ```

Solution:

Optimize batch processor:

```yaml processors: batch: # Balance between throughput and latency timeout: 5s # Max wait before sending batch send_batch_size: 512 # Send when this many items accumulated send_batch_max_size: 1024 # Never exceed this size

# For high-volume with acceptable latency batch: timeout: 30s send_batch_size: 10000 send_batch_max_size: 20000

# For low-latency requirements batch: timeout: 1s send_batch_size: 100 send_batch_max_size: 200 ```

Common Cause 7: Processor Transform Errors

Transform processor failing due to invalid operations.

Error pattern: `` Transform processor error: invalid attribute operation

Diagnosis:

```bash # Check processor metrics curl -s http://localhost:8888/metrics | grep otelcol_processor

# Check transform errors kubectl logs -l app=otel-collector -n monitoring | grep -i "transform|processor"

# Test with debug processor # Add debug processor to see data processors: debug: verbosity: detailed ```

Solution:

Fix transform processor configuration:

```yaml processors: # Correct attribute transformations attributes: actions: - key: environment value: production action: insert - key: deployment.environment from_attribute: environment action: upsert - key: sensitive.data action: delete

# Span processor for trace transformations transform: trace_statements: - context: span statements: - set(attributes["service.name"], "my-service") - keep_keys(attributes, ["service.name", "operation"]) - replace_match(attributes["http.url"], "http://*", "https://*")

# Resource processor resource: attributes: - key: k8s.cluster.name value: production-cluster action: insert - key: service.instance.id from_attribute: pod.name action: upsert ```

Common Cause 8: TLS/Certificate Issues

TLS configuration problems blocking secure connections.

Error pattern: `` TLS error: certificate verify failed

bash
x509: certificate signed by unknown authority

Diagnosis:

```bash # Test TLS connection openssl s_client -connect tempo:4317 -showcerts

# Check certificate validity curl -k https://tempo:4317/v1/traces

# Check Collector TLS configuration kubectl get configmap otel-collector-config -n monitoring -o yaml | grep -A 10 "tls"

# Look for certificate errors in logs kubectl logs -l app=otel-collector -n monitoring | grep -i "cert|tls|x509" ```

Solution:

Configure TLS properly:

```yaml # Exporter with TLS exporters: otlp: endpoint: tempo:4317 tls: insecure: false ca_file: /etc/otelcol/certs/ca.crt cert_file: /etc/otelcol/certs/client.crt key_file: /etc/otelcol/certs/client.key min_version: "1.2"

# For testing/insecure mode exporters: otlp: endpoint: tempo:4317 tls: insecure: true insecure_skip_verify: true

# Receiver TLS configuration receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 tls: cert_file: /etc/otelcol/certs/server.crt key_file: /etc/otelcol/certs/server.key ```

Verification

After fixing Collector issues:

```bash # Check Collector is healthy curl -s http://localhost:13133/ curl -s http://localhost:8888/metrics | grep otelcol_process_uptime

# Verify receivers are accepting curl -s http://localhost:8888/metrics | grep "otelcol_receiver_accepted"

# Verify exporters are sending curl -s http://localhost:8888/metrics | grep "otelcol_exporter_sent"

# Check no refused items curl -s http://localhost:8888/metrics | grep "otelcol_processor_refused" # Should be 0 or minimal

# Test telemetry flow # Send test trace curl -X POST http://localhost:4318/v1/traces \ -H "Content-Type: application/json" \ -d '{"resourceSpans":[{"resource":{"attributes":[{"key":"service.name","value":{"stringValue":"test"}}]},"scopeSpans":[{"spans":[{"traceId":"test123","spanId":"span1","name":"test-operation"}]}]}]}'

# Verify backend received it curl -s http://tempo:3200/api/traces/test123 | jq '.'

# Use zpages for diagnostics # Navigate to http://localhost:55679/debug/servicez # Check pipeline status ```

Prevention

Monitor Collector health:

```yaml groups: - name: otel_collector_health rules: - alert: OpenTelemetryCollectorDown expr: up{job="otel-collector"} == 0 for: 2m labels: severity: critical annotations: summary: "OpenTelemetry Collector is down"

  • alert: OpenTelemetryCollectorExporterErrors
  • expr: rate(otelcol_exporter_send_failed_spans_total[5m]) > 0
  • for: 5m
  • labels:
  • severity: warning
  • annotations:
  • summary: "OpenTelemetry Collector exporter errors"
  • alert: OpenTelemetryCollectorMemoryLimit
  • expr: otelcol_processor_memory_limiter_refused_spans_total > 0
  • for: 2m
  • labels:
  • severity: warning
  • annotations:
  • summary: "OpenTelemetry Collector memory limiter dropping data"
  • alert: OpenTelemetryCollectorReceiverErrors
  • expr: rate(otelcol_receiver_refused_spans_total[5m]) > 0
  • for: 5m
  • labels:
  • severity: warning
  • annotations:
  • summary: "OpenTelemetry Collector receiver refusing spans"
  • `

Regular Collector health check:

```bash #!/bin/bash # otelcol-health.sh

# Check health endpoint curl -s http://localhost:13133/ && echo "Health OK" || echo "Health FAILED"

# Check refused metrics REFUSED=$(curl -s http://localhost:8888/metrics | \ grep "otelcol_processor_refused_spans_total" | \ awk '{print $2}')

if [ "$REFUSED" -gt 0 ]; then echo "WARNING: $REFUSED spans refused" fi

# Check export errors FAILED=$(curl -s http://localhost:8888/metrics | \ grep "otelcol_exporter_send_failed_spans_total" | \ awk '{print $2}')

if [ "$FAILED" -gt 0 ]; then echo "WARNING: $FAILED spans failed to export" fi

# Check uptime UPTIME=$(curl -s http://localhost:8888/metrics | \ grep "otelcol_process_uptime_seconds" | \ awk '{print $2}') echo "Collector uptime: $UPTIME seconds" ```

OpenTelemetry Collector errors typically stem from configuration syntax, network connectivity, or resource limits. Validate configuration first, check receiver/exporter connectivity, and monitor memory usage to ensure reliable telemetry collection.