Your OpenTelemetry Collector is failing to process telemetry data, showing errors in logs, or metrics/traces/logs aren't reaching their destinations. The Collector is central to your observability pipeline, so errors here cascade across all monitoring.
Understanding Collector Architecture
OpenTelemetry Collector has four main components:
- Receivers: Accept telemetry from various sources
- Processors: Transform/enrich telemetry data
- Exporters: Send telemetry to backends
- Extensions: Additional functionality (health check, zpages, etc.)
Error patterns:
failed to push telemetry: exporter is downreceiver error: connection refusedprocessor memory_limiter exceededpipeline not initialized: configuration errorInitial Diagnosis
Check Collector status and logs:
```bash # Check Collector pod/service status kubectl get pods -l app=otel-collector -n monitoring systemctl status otel-collector
# Check Collector logs kubectl logs -l app=otel-collector -n monitoring | grep -i "error|fail" journalctl -u otel-collector | grep -i "error|fail"
# Check Collector metrics endpoint curl -s http://localhost:8888/metrics | grep -E "otelcol_receiver|otelcol_exporter|otelcol_processor"
# Check Collector health curl -s http://localhost:13133/ # Health extension
# Get Collector configuration kubectl get configmap otel-collector-config -n monitoring -o yaml
# Check internal telemetry curl -s http://localhost:8888/metrics | grep otelcol_process_uptime ```
Common Cause 1: Configuration Syntax Errors
Invalid YAML or wrong configuration structure.
Error pattern:
``
failed to load config: yaml: line 10: mapping values are not allowed here
service::pipelines::traces: no receivers definedDiagnosis:
```bash # Check configuration file syntax cat /etc/otelcol/config.yaml
# Validate YAML syntax python -c "import yaml; yaml.safe_load(open('/etc/otelcol/config.yaml'))"
# Or use yq yq eval /etc/otelcol/config.yaml
# Check Collector startup logs for config errors kubectl logs -l app=otel-collector -n monitoring | grep -i "config|error" | head -20
# Use config validation tool if available otelcol validate --config=/etc/otelcol/config.yaml ```
Solution:
Fix configuration syntax:
```yaml # Correct OpenTelemetry Collector configuration structure receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318
processors: batch: timeout: 10s send_batch_size: 1024 memory_limiter: check_interval: 1s limit_mib: 512
exporters: otlp: endpoint: tempo:4317 tls: insecure: true prometheus: endpoint: 0.0.0.0:9090
service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus] logs: receivers: [otlp] processors: [batch] exporters: [otlphttp]
extensions: health_check: endpoint: 0.0.0.0:13133 zpages: endpoint: 0.0.0.0:55679 ```
Common syntax fixes:
```yaml # WRONG: Missing receiver in pipeline service: pipelines: traces: processors: [batch] exporters: [otlp]
# CORRECT: Include receivers service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp]
# WRONG: Invalid processor reference processors: memory_limiter: limit: 512 # Wrong field name
# CORRECT: Use correct field processors: memory_limiter: limit_mib: 512 ```
Common Cause 2: Exporter Connection Failures
Exporters cannot reach destination backends.
Error pattern:
``
Exporter "otlp" failed to export items: connection refused
failed to push to endpoint: context deadline exceededDiagnosis:
```bash # Check exporter metrics curl -s http://localhost:8888/metrics | grep -E "otelcol_exporter.*failed|otelcol_exporter.*sent"
# Test backend connectivity curl -v http://tempo:4317/v1/traces curl -v http://prometheus:9090/-/healthy curl -v http://loki:3100/ready
# Check from Collector pod kubectl exec -it otel-collector-pod -- curl http://tempo:4317/v1/traces
# Check exporter errors in logs kubectl logs -l app=otel-collector -n monitoring | grep -i "exporter|failed"
# Monitor export rate curl -s http://localhost:8888/metrics | grep "rate(otelcol_exporter_sent_spans_total[5m])" ```
Solution:
Fix exporter connectivity:
```yaml # OTLP exporter configuration exporters: otlp/tempo: endpoint: tempo:4317 tls: insecure: true timeout: 30s retry_on_failure: enabled: true initial_interval: 5s max_interval: 30s max_elapsed_time: 300s sending_queue: enabled: true num_consumers: 10 queue_size: 5000
prometheusremotewrite: endpoint: http://prometheus:9090/api/v1/write tls: insecure: true
otlphttp/loki: endpoint: http://loki:3100/loki/api/v1/push tls: insecure: true
# Verify network connectivity # For Kubernetes, check service endpoints kubectl get endpoints tempo -n monitoring kubectl get endpoints prometheus -n monitoring ```
Common Cause 3: Memory Limiter Exceeded
Memory limiter processor dropping data when memory limit hit.
Error pattern:
``
Memory limiter exceeded: dropping telemetry
otelcol_processor_refused_spans_total increasingDiagnosis:
```bash # Check memory limiter metrics curl -s http://localhost:8888/metrics | grep -E "otelcol_processor_memory_limiter"
# Check refused items curl -s http://localhost:8888/metrics | grep otelcol_processor_refused
# Monitor Collector memory usage kubectl top pods -l app=otel-collector -n monitoring ps aux | grep otelcol
# Check memory configuration kubectl describe pod -l app=otel-collector -n monitoring | grep -A 5 "memory_limiter"
# Look for memory errors in logs kubectl logs -l app=otel-collector -n monitoring | grep -i "memory|limit" ```
Solution:
Adjust memory limits:
```yaml # Memory limiter processor configuration processors: memory_limiter: check_interval: 1s limit_mib: 1024 # Increase based on available memory spike_limit_mib: 256
# For container deployment, also set container limits # Kubernetes deployment resources: limits: memory: 2Gi requests: memory: 1Gi
# Increase buffer sizes if needed processors: batch: timeout: 10s send_batch_size: 1024 send_batch_max_size: 2048
# If memory is still exceeded, consider sampling processors: probabilistic_sampler: sampling_percentage: 50 # Sample 50% of traces ```
Common Cause 4: Receiver Protocol Issues
Receivers not accepting telemetry from sources.
Error pattern:
``
Receiver error: failed to accept connection
otlp receiver: grpc protocol errorDiagnosis:
```bash # Check receiver metrics curl -s http://localhost:8888/metrics | grep -E "otelcol_receiver.*accepted|otelcol_receiver.*refused"
# Check if receivers are listening netstat -tlnp | grep 4317 netstat -tlnp | grep 4318
# Test receiver connectivity grpcurl -plaintext localhost:4317 list curl -v http://localhost:4318/v1/traces
# Check receiver errors kubectl logs -l app=otel-collector -n monitoring | grep -i "receiver|protocol"
# Check incoming telemetry rate curl -s http://localhost:8888/metrics | grep "rate(otelcol_receiver_accepted_spans_total[5m])" ```
Solution:
Fix receiver configuration:
```yaml # OTLP receiver configuration receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 max_recv_msg_size_mib: 16 keepalive: server_parameters: max_connection_age: 120s max_connection_age_grace: 30s enforcement_policy: min_time_between_pings: 10s ping_without_stream_allowed: false http: endpoint: 0.0.0.0:4318 max_request_body_size: 16MiB cors: allowed_origins: - "*"
# For Jaeger receiver receivers: jaeger: protocols: grpc: endpoint: 0.0.0.0:14250 thrift_http: endpoint: 0.0.0.0:14268 thrift_binary: endpoint: 0.0.0.0:6832
# For Kafka receiver receivers: kafka: brokers: ["kafka:9092"] topic: "otel-spans" encoding: otlp_proto group_id: "otel-collector" ```
Common Cause 5: Pipeline Not Started
Pipelines fail to initialize due to missing components.
Error pattern:
``
Pipeline "traces" not started: no exporters
Service pipeline initialization failedDiagnosis:
```bash # Check pipeline status in logs kubectl logs -l app=otel-collector -n monitoring | grep -i "pipeline|initialized|started"
# Verify service configuration cat /etc/otelcol/config.yaml | grep -A 20 "service:"
# Check component registration kubectl logs -l app=otel-collector -n monitoring --since=5m | grep -i "registered"
# Check metrics for pipeline activity curl -s http://localhost:8888/metrics | grep otelcol_processor_spans_processed_total ```
Solution:
Fix pipeline definition:
```yaml # Complete pipeline configuration receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317
processors: batch: timeout: 10s
exporters: otlp: endpoint: tempo:4317 tls: insecure: true
# Service MUST reference defined components service: pipelines: traces: receivers: [otlp] # Must be defined in receivers section processors: [batch] # Must be defined in processors section exporters: [otlp] # Must be defined in exporters section extensions: [health_check]
extensions: health_check: endpoint: 0.0.0.0:13133
# Each component name in pipeline must match exactly # Common mistake: typo in component name service: pipelines: traces: receivers: [otlp_receivr] # WRONG - typo exporters: [otlp_exporter] # WRONG - name doesn't match ```
Common Cause 6: Batch Processor Timeout
Batch processor holding data too long before sending.
Error pattern:
``
Batch processor timeout exceeded
Data delayed in batch processorDiagnosis:
```bash # Check batch processor metrics curl -s http://localhost:8888/metrics | grep -E "otelcol_processor_batch"
# Check batch latency curl -s http://localhost:8888/metrics | grep "otelcol_processor_batch_latency"
# Monitor batch sizes curl -s http://localhost:8888/metrics | grep "otelcol_processor_batch_batch_send_size"
# Check for timeout issues in logs kubectl logs -l app=otel-collector -n monitoring | grep -i "batch|timeout" ```
Solution:
Optimize batch processor:
```yaml processors: batch: # Balance between throughput and latency timeout: 5s # Max wait before sending batch send_batch_size: 512 # Send when this many items accumulated send_batch_max_size: 1024 # Never exceed this size
# For high-volume with acceptable latency batch: timeout: 30s send_batch_size: 10000 send_batch_max_size: 20000
# For low-latency requirements batch: timeout: 1s send_batch_size: 100 send_batch_max_size: 200 ```
Common Cause 7: Processor Transform Errors
Transform processor failing due to invalid operations.
Error pattern:
``
Transform processor error: invalid attribute operation
Diagnosis:
```bash # Check processor metrics curl -s http://localhost:8888/metrics | grep otelcol_processor
# Check transform errors kubectl logs -l app=otel-collector -n monitoring | grep -i "transform|processor"
# Test with debug processor # Add debug processor to see data processors: debug: verbosity: detailed ```
Solution:
Fix transform processor configuration:
```yaml processors: # Correct attribute transformations attributes: actions: - key: environment value: production action: insert - key: deployment.environment from_attribute: environment action: upsert - key: sensitive.data action: delete
# Span processor for trace transformations transform: trace_statements: - context: span statements: - set(attributes["service.name"], "my-service") - keep_keys(attributes, ["service.name", "operation"]) - replace_match(attributes["http.url"], "http://*", "https://*")
# Resource processor resource: attributes: - key: k8s.cluster.name value: production-cluster action: insert - key: service.instance.id from_attribute: pod.name action: upsert ```
Common Cause 8: TLS/Certificate Issues
TLS configuration problems blocking secure connections.
Error pattern:
``
TLS error: certificate verify failed
x509: certificate signed by unknown authorityDiagnosis:
```bash # Test TLS connection openssl s_client -connect tempo:4317 -showcerts
# Check certificate validity curl -k https://tempo:4317/v1/traces
# Check Collector TLS configuration kubectl get configmap otel-collector-config -n monitoring -o yaml | grep -A 10 "tls"
# Look for certificate errors in logs kubectl logs -l app=otel-collector -n monitoring | grep -i "cert|tls|x509" ```
Solution:
Configure TLS properly:
```yaml # Exporter with TLS exporters: otlp: endpoint: tempo:4317 tls: insecure: false ca_file: /etc/otelcol/certs/ca.crt cert_file: /etc/otelcol/certs/client.crt key_file: /etc/otelcol/certs/client.key min_version: "1.2"
# For testing/insecure mode exporters: otlp: endpoint: tempo:4317 tls: insecure: true insecure_skip_verify: true
# Receiver TLS configuration receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 tls: cert_file: /etc/otelcol/certs/server.crt key_file: /etc/otelcol/certs/server.key ```
Verification
After fixing Collector issues:
```bash # Check Collector is healthy curl -s http://localhost:13133/ curl -s http://localhost:8888/metrics | grep otelcol_process_uptime
# Verify receivers are accepting curl -s http://localhost:8888/metrics | grep "otelcol_receiver_accepted"
# Verify exporters are sending curl -s http://localhost:8888/metrics | grep "otelcol_exporter_sent"
# Check no refused items curl -s http://localhost:8888/metrics | grep "otelcol_processor_refused" # Should be 0 or minimal
# Test telemetry flow # Send test trace curl -X POST http://localhost:4318/v1/traces \ -H "Content-Type: application/json" \ -d '{"resourceSpans":[{"resource":{"attributes":[{"key":"service.name","value":{"stringValue":"test"}}]},"scopeSpans":[{"spans":[{"traceId":"test123","spanId":"span1","name":"test-operation"}]}]}]}'
# Verify backend received it curl -s http://tempo:3200/api/traces/test123 | jq '.'
# Use zpages for diagnostics # Navigate to http://localhost:55679/debug/servicez # Check pipeline status ```
Prevention
Monitor Collector health:
```yaml groups: - name: otel_collector_health rules: - alert: OpenTelemetryCollectorDown expr: up{job="otel-collector"} == 0 for: 2m labels: severity: critical annotations: summary: "OpenTelemetry Collector is down"
- alert: OpenTelemetryCollectorExporterErrors
- expr: rate(otelcol_exporter_send_failed_spans_total[5m]) > 0
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "OpenTelemetry Collector exporter errors"
- alert: OpenTelemetryCollectorMemoryLimit
- expr: otelcol_processor_memory_limiter_refused_spans_total > 0
- for: 2m
- labels:
- severity: warning
- annotations:
- summary: "OpenTelemetry Collector memory limiter dropping data"
- alert: OpenTelemetryCollectorReceiverErrors
- expr: rate(otelcol_receiver_refused_spans_total[5m]) > 0
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "OpenTelemetry Collector receiver refusing spans"
`
Regular Collector health check:
```bash #!/bin/bash # otelcol-health.sh
# Check health endpoint curl -s http://localhost:13133/ && echo "Health OK" || echo "Health FAILED"
# Check refused metrics REFUSED=$(curl -s http://localhost:8888/metrics | \ grep "otelcol_processor_refused_spans_total" | \ awk '{print $2}')
if [ "$REFUSED" -gt 0 ]; then echo "WARNING: $REFUSED spans refused" fi
# Check export errors FAILED=$(curl -s http://localhost:8888/metrics | \ grep "otelcol_exporter_send_failed_spans_total" | \ awk '{print $2}')
if [ "$FAILED" -gt 0 ]; then echo "WARNING: $FAILED spans failed to export" fi
# Check uptime UPTIME=$(curl -s http://localhost:8888/metrics | \ grep "otelcol_process_uptime_seconds" | \ awk '{print $2}') echo "Collector uptime: $UPTIME seconds" ```
OpenTelemetry Collector errors typically stem from configuration syntax, network connectivity, or resource limits. Validate configuration first, check receiver/exporter connectivity, and monitor memory usage to ensure reliable telemetry collection.