Fix Thanos Query Error

Thanos Query is returning errors when you try to query historical metrics, or the Thanos UI shows unavailable stores. Thanos extends Prometheus for long-term storage, but configuration issues can prevent queries from working properly.

Understanding Thanos Architecture

Thanos consists of several components:

Sidecar: Connects to Prometheus, uploads data to object storage
Store Gateway: Serves historical data from object storage
Query: Unified query interface across all data sources
Compactor: Downsamples and compacts historical data
Receive: Receives remote writes (optional)

Common error patterns:

bash

store gateway unavailable: connection refused

bash

error querying store: context deadline exceeded

bash

partial response: no stores matched query

bash

sidecar not reachable: Prometheus not responding

Initial Diagnosis

Check Thanos Query status and connected stores:

```bash # Check Thanos Query logs kubectl logs -l app=thanos-query -n monitoring | grep -i "error|warn|fail"

# Or for direct installation journalctl -u thanos-query -f | grep -i "error"

# Check Query UI stores endpoint curl -s http://thanos-query:19192/api/v1/stores | jq '.'

# Check which stores are healthy curl -s http://thanos-query:19192/api/v1/stores | jq '.[] | {store: .store, health: .health}'

# Query Prometheus directly to compare curl -s 'http://prometheus:9090/api/v1/query?query=up'

# Query Thanos to compare curl -s 'http://thanos-query:19192/api/v1/query?query=up'

# Check Thanos components status kubectl get pods -l app=thanos -n monitoring kubectl get svc -l app=thanos -n monitoring ```

Common Cause 1: Store Gateway Unavailable

Store Gateway serves historical data but is not reachable.

Error pattern: ``store gateway unavailable: dial tcp 10.0.0.5:10905: connection refused

Diagnosis:

```bash # Check Store Gateway service kubectl get svc thanos-store -n monitoring kubectl get endpoints thanos-store -n monitoring

# Check Store Gateway pods kubectl get pods -l app=thanos-store -n monitoring kubectl logs -l app=thanos-store -n monitoring | grep -i "error|start"

# Test direct connection to Store Gateway curl -v http://thanos-store:10905/ready curl -v http://thanos-store:10905/health

# Check Store Gateway gRPC port nc -zv thanos-store 10905

# Check object storage connection from Store Gateway kubectl logs -l app=thanos-store -n monitoring | grep -i "bucket|storage|object" ```

Solution:

Fix Store Gateway connectivity:

```bash # Check Store Gateway configuration kubectl describe pod -l app=thanos-store -n monitoring

# Verify Store Gateway arguments kubectl get pod -l app=thanos-store -n monitoring -o yaml | grep -A 20 args

# Ensure Store Gateway is configured correctly # Typical Store Gateway arguments: thanos store \ --data-dir=/data \ --objstore.config-file=/etc/thanos/bucket.yaml \ --grpc-address=0.0.0.0:10905 \ --http-address=0.0.0.0:10902 \ --index-cache.config-file=/etc/thanos/index-cache.yaml

# Restart Store Gateway if needed kubectl rollout restart deployment/thanos-store -n monitoring

# Wait for ready kubectl rollout status deployment/thanos-store -n monitoring ```

Common Cause 2: Object Storage Access Issues

Thanos components cannot access the object storage bucket.

Error pattern: ``bucket not accessible: NoSuchBucket

bash

error accessing bucket: AccessDenied

Diagnosis:

```bash # Check bucket configuration kubectl get secret thanos-bucket-config -n monitoring -o jsonpath='{.data.bucket\.yaml}' | base64 -d

# Test bucket access manually # For S3 aws s3 ls s3://thanos-bucket/ aws s3api head-bucket --bucket thanos-bucket

# For GCS gsutil ls gs://thanos-bucket/

# Check credentials kubectl describe secret thanos-storage-secret -n monitoring

# Look for storage errors in logs kubectl logs -l app=thanos-store -n monitoring | grep -i "bucket|access|denied|error" ```

Solution:

Fix object storage configuration:

```yaml # bucket.yaml for S3 type: S3 config: bucket: thanos-bucket endpoint: s3.amazonaws.com region: us-east-1 access_key: AKIAIOSFODNN7EXAMPLE secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY insecure: false signature_version2: false

# bucket.yaml for GCS type: GCS config: bucket: thanos-bucket service_account: /etc/thanos/gcs-credentials.json

# bucket.yaml for Azure type: AZURE config: storage_account_name: thanosstorage storage_account_key: base64-encoded-key container: thanos-container

# Apply correct configuration kubectl create secret generic thanos-bucket-config \ --from-file=bucket.yaml=./bucket.yaml \ -n monitoring --dry-run=client -o yaml | kubectl apply -f - ```

Verify IAM permissions for S3:

bash

# Required S3 permissions
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::thanos-bucket",
        "arn:aws:s3:::thanos-bucket/*"
      ]
    }
  ]
}

Common Cause 3: Sidecar Not Connected to Prometheus

Thanos Sidecar cannot reach its Prometheus instance.

Error pattern: ``sidecar not reachable: Prometheus at localhost:9090 not responding

Diagnosis:

```bash # Check Sidecar pod status kubectl get pods -l app=thanos-sidecar -n monitoring

# Check Sidecar logs kubectl logs -l app=thanos-sidecar -n monitoring | grep -i "prometheus|error"

# Verify Sidecar is in same pod as Prometheus or can reach it kubectl describe pod prometheus-server-0 -n monitoring | grep -A 20 "Containers"

# Test Prometheus connection from Sidecar kubectl exec -it prometheus-server-0 -c thanos-sidecar -n monitoring -- \ curl http://localhost:9090/-/healthy

# Check Sidecar arguments kubectl get pod prometheus-server-0 -n monitoring -o jsonpath='{.spec.containers[?(@.name=="thanos-sidecar")].args}' ```

Solution:

Fix Sidecar configuration:

```yaml # Prometheus pod with Sidecar containers: - name: prometheus image: prom/prometheus:v2.40.0 args: - --storage.tsdb.path=/data - --storage.tsdb.retention.time=24h - --storage.tsdb.min-block-duration=2h - --storage.tsdb.max-block-duration=2h - --web.enable-lifecycle

name: thanos-sidecar
image: thanosio/thanos:v0.31.0
args:
- sidecar
- --prometheus.url=http://localhost:9090
- --grpc-address=0.0.0.0:10905
- --http-address=0.0.0.0:10902
- --objstore.config-file=/etc/thanos/bucket.yaml
- --tsdb.path=/data
volumeMounts:
- name: prometheus-data
mountPath: /data
- name: thanos-config
mountPath: /etc/thanos
`

Ensure Sidecar and Prometheus share storage:

yaml

volumes:
  - name: prometheus-data
    emptyDir: {}  # Or persistent volume
  - name: thanos-config
    secret:
      name: thanos-bucket-config

Common Cause 4: Query Timeout Issues

Long-running queries exceed timeout limits.

Error pattern: ``context deadline exceeded: query timeout

Diagnosis:

```bash # Check current timeout configuration kubectl get deployment thanos-query -n monitoring -o yaml | grep -i timeout

# Test query execution time time curl -s 'http://thanos-query:19192/api/v1/query_range?query=up&start=1640000000&end=1640003600&step=60s'

# Check Query logs for timeout errors kubectl logs -l app=thanos-query -n monitoring | grep -i "timeout|deadline"

# Monitor query response times curl -s http://thanos-query:19192/api/v1/query?query=up | jq '.status, .data.result' ```

Solution:

Adjust timeout settings:

```bash # Thanos Query configuration thanos query \ --grpc-address=0.0.0.0:10905 \ --http-address=0.0.0.0:19192 \ --store=thanos-sidecar:10905 \ --store=thanos-store:10905 \ --query.timeout=5m \ # Increase timeout --query.lookback-delta=15m \ --query.max-concurrent=20 \ --query.max-samples=100000000 # Increase sample limit

# For Store Gateway timeouts thanos store \ --grpc.grpc-max-send-msg-size=100MB \ --grpc.grpc-max-recv-msg-size=100MB \ --query.timeout=5m ```

Common Cause 5: Missing Store Endpoints

Query is not configured with all necessary stores.

Error pattern: ``partial response: no stores matched query

Diagnosis:

```bash # Check current store configuration kubectl get deployment thanos-query -n monitoring -o yaml | grep -A 5 "store"

# List configured stores via API curl -s http://thanos-query:19192/api/v1/stores | jq '.[].store'

# Check if stores match query time range curl -s 'http://thanos-query:19192/api/v1/stores' | jq '.[] | {store: .store, minTime: .minTime, maxTime: .maxTime}'

# Query time range that should have data curl -s 'http://thanos-query:19192/api/v1/query_range?query=up&start=1640000000&end=1640086400&step=3600s' ```

Solution:

Configure all necessary stores:

```bash # Add all stores to Query configuration thanos query \ --store=thanos-sidecar-0:10905 \ --store=thanos-sidecar-1:10905 \ --store=thanos-store:10905 \ --store=thanos-receive:10905 \ --query.auto-downsampling # Enable auto downsampling for long ranges

# For Kubernetes, update deployment kubectl set env deployment/thanos-query \ STORES=thanos-sidecar-0:10905,thanos-sidecar-1:10905,thanos-store:10905 \ -n monitoring ```

Add ServiceDiscovery for stores:

```yaml # Use DNS-based store discovery thanos query \ --store=dnssrv+_grpc._tcp.thanos-sidecar.monitoring.svc.cluster.local \ --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc.cluster.local \ --store=dnssrv+_grpc._tcp.thanos-receive.monitoring.svc.cluster.local

# This automatically discovers all matching services ```

Common Cause 6: Compactor Issues

Compactor not running or failing to compact blocks.

Error pattern: ``compactor failed: error compacting blocks

Diagnosis:

```bash # Check Compactor status kubectl get pods -l app=thanos-compactor -n monitoring

# Check Compactor logs kubectl logs -l app=thanos-compactor -n monitoring | grep -i "error|fail|compact"

# Check block metadata in bucket # For S3 aws s3 ls s3://thanos-bucket/ --recursive | grep meta

# Check for downsampling kubectl logs -l app=thanos-compactor -n monitoring | grep -i "downsample"

# Verify Compactor has exclusive lock kubectl logs -l app=thanos-compactor -n monitoring | grep -i "lock" ```

Solution:

Fix Compactor configuration:

```bash # Compactor must run as single instance (leader election) thanos compact \ --data-dir=/data \ --objstore.config-file=/etc/thanos/bucket.yaml \ --grpc-address=0.0.0.0:10905 \ --http-address=0.0.0.0:10902 \ --wait # Continuously compact new blocks --wait-interval=5m \ --downsampling.enabled \ --deduplication.enabled

# For Kubernetes with leader election apiVersion: apps/v1 kind: Deployment metadata: name: thanos-compactor spec: replicas: 1 # Must be exactly 1 selector: matchLabels: app: thanos-compactor template: spec: containers: - name: thanos-compactor args: - compact - --wait - --objstore.config-file=/etc/thanos/bucket.yaml - --downsampling.enabled ```

Common Cause 7: Time Range Gaps

Data exists but has gaps in time series.

Error pattern: ``no data for queried time range

Diagnosis:

```bash # Check available time ranges in stores curl -s 'http://thanos-query:19192/api/v1/stores' | \ jq '.[] | select(.store | contains("store")) | {minTime: .minTime, maxTime: .maxTime}'

# Convert timestamps to readable dates # minTime and maxTime are in milliseconds since epoch date -d @1640000000

# Check block coverage in object storage aws s3 ls s3://thanos-bucket/ --recursive | \ awk '{print $4}' | grep meta | sort

# Query specific time range curl -s 'http://thanos-query:19192/api/v1/query_range?query=up&start=1640000000&end=1640086400&step=3600s' | \ jq '.data.result[].values | length' ```

Solution:

Fill gaps or adjust query time range:

```bash # Check if data exists in raw blocks # Download and inspect a block aws s3 cp s3://thanos-bucket/01ABC123/meta.json ./meta.json cat meta.json

# Run compactor to fill in missing downsampling kubectl rollout restart deployment/thanos-compactor -n monitoring

# Query with appropriate resolution # For long time ranges, use lower resolution (5m or 1h steps) curl -s 'http://thanos-query:19192/api/v1/query_range?query=rate(cpu[5m])&start=1640000000&end=1640003600&step=300s'

# Enable query.auto-downsampling for automatic resolution adjustment ```

Verification

After fixing, verify Thanos is working:

```bash # Check all stores are healthy curl -s http://thanos-query:19192/api/v1/stores | jq '.[] | {store, health}'

# Query historical data curl -s 'http://thanos-query:19192/api/v1/query_range?query=up&start=-24h&end=now&step=5m' | \ jq '.data.result[].values | length'

# Compare Prometheus and Thanos results # Should match for recent data curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result' curl -s 'http://thanos-query:19192/api/v1/query?query=up' | jq '.data.result'

# Check Thanos UI # Navigate to http://thanos-query:19192 # Check Stores tab for all connected stores # Verify Query tab returns results

# Check object storage for recent uploads aws s3 ls s3://thanos-bucket/ --recursive | tail -20 ```

Prevention

Monitor Thanos components:

```yaml groups: - name: thanos_health rules: - alert: ThanosQueryStoreUnavailable expr: thanos_store_unavailable > 0 for: 5m labels: severity: critical annotations: summary: "Thanos Query cannot reach stores"

alert: ThanosCompactorNotRunning
expr: absent(thanos_compactor_up) == 1
for: 10m
labels:
severity: critical
annotations:
summary: "Thanos Compactor is not running"

alert: ThanosSidecarUploadFailure
expr: rate(thanos_sidecar_upload_failures_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Thanos Sidecar failing to upload blocks"

alert: ThanosBucketOperationError
expr: rate(thanos_objstore_bucket_operation_failures_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Thanos bucket operations failing"
`

Thanos query errors usually stem from store gateway connectivity, object storage access, or configuration issues. Check store health first, then verify bucket access and ensure all components are properly configured to communicate.

Understanding Thanos Architecture

Initial Diagnosis

Common Cause 1: Store Gateway Unavailable

Common Cause 2: Object Storage Access Issues

Common Cause 3: Sidecar Not Connected to Prometheus

Common Cause 4: Query Timeout Issues

Common Cause 5: Missing Store Endpoints

Common Cause 6: Compactor Issues

Common Cause 7: Time Range Gaps

Verification

Prevention

Share this guide

More Monitoring Troubleshooting Guides

Metric Retention Expired

Timeseries Storage Full

Collector Agent Crashed

Webhook Notification Timeout

SMS Notification Failed

Email Notification Bounced