Thanos Query is returning errors when you try to query historical metrics, or the Thanos UI shows unavailable stores. Thanos extends Prometheus for long-term storage, but configuration issues can prevent queries from working properly.
Understanding Thanos Architecture
Thanos consists of several components:
- Sidecar: Connects to Prometheus, uploads data to object storage
- Store Gateway: Serves historical data from object storage
- Query: Unified query interface across all data sources
- Compactor: Downsamples and compacts historical data
- Receive: Receives remote writes (optional)
Common error patterns:
store gateway unavailable: connection refusederror querying store: context deadline exceededpartial response: no stores matched querysidecar not reachable: Prometheus not respondingInitial Diagnosis
Check Thanos Query status and connected stores:
```bash # Check Thanos Query logs kubectl logs -l app=thanos-query -n monitoring | grep -i "error|warn|fail"
# Or for direct installation journalctl -u thanos-query -f | grep -i "error"
# Check Query UI stores endpoint curl -s http://thanos-query:19192/api/v1/stores | jq '.'
# Check which stores are healthy curl -s http://thanos-query:19192/api/v1/stores | jq '.[] | {store: .store, health: .health}'
# Query Prometheus directly to compare curl -s 'http://prometheus:9090/api/v1/query?query=up'
# Query Thanos to compare curl -s 'http://thanos-query:19192/api/v1/query?query=up'
# Check Thanos components status kubectl get pods -l app=thanos -n monitoring kubectl get svc -l app=thanos -n monitoring ```
Common Cause 1: Store Gateway Unavailable
Store Gateway serves historical data but is not reachable.
Error pattern:
``
store gateway unavailable: dial tcp 10.0.0.5:10905: connection refused
Diagnosis:
```bash # Check Store Gateway service kubectl get svc thanos-store -n monitoring kubectl get endpoints thanos-store -n monitoring
# Check Store Gateway pods kubectl get pods -l app=thanos-store -n monitoring kubectl logs -l app=thanos-store -n monitoring | grep -i "error|start"
# Test direct connection to Store Gateway curl -v http://thanos-store:10905/ready curl -v http://thanos-store:10905/health
# Check Store Gateway gRPC port nc -zv thanos-store 10905
# Check object storage connection from Store Gateway kubectl logs -l app=thanos-store -n monitoring | grep -i "bucket|storage|object" ```
Solution:
Fix Store Gateway connectivity:
```bash # Check Store Gateway configuration kubectl describe pod -l app=thanos-store -n monitoring
# Verify Store Gateway arguments kubectl get pod -l app=thanos-store -n monitoring -o yaml | grep -A 20 args
# Ensure Store Gateway is configured correctly # Typical Store Gateway arguments: thanos store \ --data-dir=/data \ --objstore.config-file=/etc/thanos/bucket.yaml \ --grpc-address=0.0.0.0:10905 \ --http-address=0.0.0.0:10902 \ --index-cache.config-file=/etc/thanos/index-cache.yaml
# Restart Store Gateway if needed kubectl rollout restart deployment/thanos-store -n monitoring
# Wait for ready kubectl rollout status deployment/thanos-store -n monitoring ```
Common Cause 2: Object Storage Access Issues
Thanos components cannot access the object storage bucket.
Error pattern:
``
bucket not accessible: NoSuchBucket
error accessing bucket: AccessDeniedDiagnosis:
```bash # Check bucket configuration kubectl get secret thanos-bucket-config -n monitoring -o jsonpath='{.data.bucket\.yaml}' | base64 -d
# Test bucket access manually # For S3 aws s3 ls s3://thanos-bucket/ aws s3api head-bucket --bucket thanos-bucket
# For GCS gsutil ls gs://thanos-bucket/
# Check credentials kubectl describe secret thanos-storage-secret -n monitoring
# Look for storage errors in logs kubectl logs -l app=thanos-store -n monitoring | grep -i "bucket|access|denied|error" ```
Solution:
Fix object storage configuration:
```yaml # bucket.yaml for S3 type: S3 config: bucket: thanos-bucket endpoint: s3.amazonaws.com region: us-east-1 access_key: AKIAIOSFODNN7EXAMPLE secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY insecure: false signature_version2: false
# bucket.yaml for GCS type: GCS config: bucket: thanos-bucket service_account: /etc/thanos/gcs-credentials.json
# bucket.yaml for Azure type: AZURE config: storage_account_name: thanosstorage storage_account_key: base64-encoded-key container: thanos-container
# Apply correct configuration kubectl create secret generic thanos-bucket-config \ --from-file=bucket.yaml=./bucket.yaml \ -n monitoring --dry-run=client -o yaml | kubectl apply -f - ```
Verify IAM permissions for S3:
# Required S3 permissions
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::thanos-bucket",
"arn:aws:s3:::thanos-bucket/*"
]
}
]
}Common Cause 3: Sidecar Not Connected to Prometheus
Thanos Sidecar cannot reach its Prometheus instance.
Error pattern:
``
sidecar not reachable: Prometheus at localhost:9090 not responding
Diagnosis:
```bash # Check Sidecar pod status kubectl get pods -l app=thanos-sidecar -n monitoring
# Check Sidecar logs kubectl logs -l app=thanos-sidecar -n monitoring | grep -i "prometheus|error"
# Verify Sidecar is in same pod as Prometheus or can reach it kubectl describe pod prometheus-server-0 -n monitoring | grep -A 20 "Containers"
# Test Prometheus connection from Sidecar kubectl exec -it prometheus-server-0 -c thanos-sidecar -n monitoring -- \ curl http://localhost:9090/-/healthy
# Check Sidecar arguments kubectl get pod prometheus-server-0 -n monitoring -o jsonpath='{.spec.containers[?(@.name=="thanos-sidecar")].args}' ```
Solution:
Fix Sidecar configuration:
```yaml # Prometheus pod with Sidecar containers: - name: prometheus image: prom/prometheus:v2.40.0 args: - --storage.tsdb.path=/data - --storage.tsdb.retention.time=24h - --storage.tsdb.min-block-duration=2h - --storage.tsdb.max-block-duration=2h - --web.enable-lifecycle
- name: thanos-sidecar
- image: thanosio/thanos:v0.31.0
- args:
- - sidecar
- - --prometheus.url=http://localhost:9090
- - --grpc-address=0.0.0.0:10905
- - --http-address=0.0.0.0:10902
- - --objstore.config-file=/etc/thanos/bucket.yaml
- - --tsdb.path=/data
- volumeMounts:
- - name: prometheus-data
- mountPath: /data
- - name: thanos-config
- mountPath: /etc/thanos
`
Ensure Sidecar and Prometheus share storage:
volumes:
- name: prometheus-data
emptyDir: {} # Or persistent volume
- name: thanos-config
secret:
name: thanos-bucket-configCommon Cause 4: Query Timeout Issues
Long-running queries exceed timeout limits.
Error pattern:
``
context deadline exceeded: query timeout
Diagnosis:
```bash # Check current timeout configuration kubectl get deployment thanos-query -n monitoring -o yaml | grep -i timeout
# Test query execution time time curl -s 'http://thanos-query:19192/api/v1/query_range?query=up&start=1640000000&end=1640003600&step=60s'
# Check Query logs for timeout errors kubectl logs -l app=thanos-query -n monitoring | grep -i "timeout|deadline"
# Monitor query response times curl -s http://thanos-query:19192/api/v1/query?query=up | jq '.status, .data.result' ```
Solution:
Adjust timeout settings:
```bash # Thanos Query configuration thanos query \ --grpc-address=0.0.0.0:10905 \ --http-address=0.0.0.0:19192 \ --store=thanos-sidecar:10905 \ --store=thanos-store:10905 \ --query.timeout=5m \ # Increase timeout --query.lookback-delta=15m \ --query.max-concurrent=20 \ --query.max-samples=100000000 # Increase sample limit
# For Store Gateway timeouts thanos store \ --grpc.grpc-max-send-msg-size=100MB \ --grpc.grpc-max-recv-msg-size=100MB \ --query.timeout=5m ```
Common Cause 5: Missing Store Endpoints
Query is not configured with all necessary stores.
Error pattern:
``
partial response: no stores matched query
Diagnosis:
```bash # Check current store configuration kubectl get deployment thanos-query -n monitoring -o yaml | grep -A 5 "store"
# List configured stores via API curl -s http://thanos-query:19192/api/v1/stores | jq '.[].store'
# Check if stores match query time range curl -s 'http://thanos-query:19192/api/v1/stores' | jq '.[] | {store: .store, minTime: .minTime, maxTime: .maxTime}'
# Query time range that should have data curl -s 'http://thanos-query:19192/api/v1/query_range?query=up&start=1640000000&end=1640086400&step=3600s' ```
Solution:
Configure all necessary stores:
```bash # Add all stores to Query configuration thanos query \ --store=thanos-sidecar-0:10905 \ --store=thanos-sidecar-1:10905 \ --store=thanos-store:10905 \ --store=thanos-receive:10905 \ --query.auto-downsampling # Enable auto downsampling for long ranges
# For Kubernetes, update deployment kubectl set env deployment/thanos-query \ STORES=thanos-sidecar-0:10905,thanos-sidecar-1:10905,thanos-store:10905 \ -n monitoring ```
Add ServiceDiscovery for stores:
```yaml # Use DNS-based store discovery thanos query \ --store=dnssrv+_grpc._tcp.thanos-sidecar.monitoring.svc.cluster.local \ --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc.cluster.local \ --store=dnssrv+_grpc._tcp.thanos-receive.monitoring.svc.cluster.local
# This automatically discovers all matching services ```
Common Cause 6: Compactor Issues
Compactor not running or failing to compact blocks.
Error pattern:
``
compactor failed: error compacting blocks
Diagnosis:
```bash # Check Compactor status kubectl get pods -l app=thanos-compactor -n monitoring
# Check Compactor logs kubectl logs -l app=thanos-compactor -n monitoring | grep -i "error|fail|compact"
# Check block metadata in bucket # For S3 aws s3 ls s3://thanos-bucket/ --recursive | grep meta
# Check for downsampling kubectl logs -l app=thanos-compactor -n monitoring | grep -i "downsample"
# Verify Compactor has exclusive lock kubectl logs -l app=thanos-compactor -n monitoring | grep -i "lock" ```
Solution:
Fix Compactor configuration:
```bash # Compactor must run as single instance (leader election) thanos compact \ --data-dir=/data \ --objstore.config-file=/etc/thanos/bucket.yaml \ --grpc-address=0.0.0.0:10905 \ --http-address=0.0.0.0:10902 \ --wait # Continuously compact new blocks --wait-interval=5m \ --downsampling.enabled \ --deduplication.enabled
# For Kubernetes with leader election apiVersion: apps/v1 kind: Deployment metadata: name: thanos-compactor spec: replicas: 1 # Must be exactly 1 selector: matchLabels: app: thanos-compactor template: spec: containers: - name: thanos-compactor args: - compact - --wait - --objstore.config-file=/etc/thanos/bucket.yaml - --downsampling.enabled ```
Common Cause 7: Time Range Gaps
Data exists but has gaps in time series.
Error pattern:
``
no data for queried time range
Diagnosis:
```bash # Check available time ranges in stores curl -s 'http://thanos-query:19192/api/v1/stores' | \ jq '.[] | select(.store | contains("store")) | {minTime: .minTime, maxTime: .maxTime}'
# Convert timestamps to readable dates # minTime and maxTime are in milliseconds since epoch date -d @1640000000
# Check block coverage in object storage aws s3 ls s3://thanos-bucket/ --recursive | \ awk '{print $4}' | grep meta | sort
# Query specific time range curl -s 'http://thanos-query:19192/api/v1/query_range?query=up&start=1640000000&end=1640086400&step=3600s' | \ jq '.data.result[].values | length' ```
Solution:
Fill gaps or adjust query time range:
```bash # Check if data exists in raw blocks # Download and inspect a block aws s3 cp s3://thanos-bucket/01ABC123/meta.json ./meta.json cat meta.json
# Run compactor to fill in missing downsampling kubectl rollout restart deployment/thanos-compactor -n monitoring
# Query with appropriate resolution # For long time ranges, use lower resolution (5m or 1h steps) curl -s 'http://thanos-query:19192/api/v1/query_range?query=rate(cpu[5m])&start=1640000000&end=1640003600&step=300s'
# Enable query.auto-downsampling for automatic resolution adjustment ```
Verification
After fixing, verify Thanos is working:
```bash # Check all stores are healthy curl -s http://thanos-query:19192/api/v1/stores | jq '.[] | {store, health}'
# Query historical data curl -s 'http://thanos-query:19192/api/v1/query_range?query=up&start=-24h&end=now&step=5m' | \ jq '.data.result[].values | length'
# Compare Prometheus and Thanos results # Should match for recent data curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result' curl -s 'http://thanos-query:19192/api/v1/query?query=up' | jq '.data.result'
# Check Thanos UI # Navigate to http://thanos-query:19192 # Check Stores tab for all connected stores # Verify Query tab returns results
# Check object storage for recent uploads aws s3 ls s3://thanos-bucket/ --recursive | tail -20 ```
Prevention
Monitor Thanos components:
```yaml groups: - name: thanos_health rules: - alert: ThanosQueryStoreUnavailable expr: thanos_store_unavailable > 0 for: 5m labels: severity: critical annotations: summary: "Thanos Query cannot reach stores"
- alert: ThanosCompactorNotRunning
- expr: absent(thanos_compactor_up) == 1
- for: 10m
- labels:
- severity: critical
- annotations:
- summary: "Thanos Compactor is not running"
- alert: ThanosSidecarUploadFailure
- expr: rate(thanos_sidecar_upload_failures_total[5m]) > 0
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "Thanos Sidecar failing to upload blocks"
- alert: ThanosBucketOperationError
- expr: rate(thanos_objstore_bucket_operation_failures_total[5m]) > 0
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "Thanos bucket operations failing"
`
Thanos query errors usually stem from store gateway connectivity, object storage access, or configuration issues. Check store health first, then verify bucket access and ensure all components are properly configured to communicate.