Fix Thanos Query Timeout

What's Actually Happening

Thanos Query returns timeout errors when querying metrics. Long-running queries fail before completion.

The Error You'll See

```bash $ curl 'http://thanos-query:9090/api/v1/query?query=up'

Error: context deadline exceeded ```

Store gateway timeout:

bash

Error: fetching series: context deadline exceeded for store gateway

Query range timeout:

bash

Error: query range: timeout after 2m0s

gRPC error:

bash

Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Why This Happens

1.Query too complex - Query scans too much data
2.Store gateway slow - Remote store access delayed
3.Network latency - High latency between components
4.Resource limits - Thanos pods CPU/memory exhausted
5.Timeout too short - Query timeout insufficient
6.Downsampling missing - Querying raw data without downsampling

Step 1: Check Thanos Query Status

```bash # Check Thanos Query pods: kubectl get pods -n monitoring -l app=thanos-query

# View Query logs: kubectl logs -n monitoring -l app=thanos-query

# Check for timeout errors: kubectl logs -n monitoring -l app=thanos-query | grep -i "timeout|deadline"

# Check Query configuration: kubectl get configmap -n monitoring thanos-query -o yaml

# Check Query service: kubectl get svc -n monitoring thanos-query

# Test Query endpoint: kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never \ -- curl -s http://thanos-query:9090/-/healthy

# Check Query UI: kubectl port-forward -n monitoring svc/thanos-query 9090:9090 & # Open http://localhost:9090

# Check connected stores: curl http://thanos-query:9090/api/v1/stores | jq . ```

Step 2: Check Store Gateway Status

```bash # Check Store Gateway pods: kubectl get pods -n monitoring -l app=thanos-store-gateway

# View Store Gateway logs: kubectl logs -n monitoring -l app=thanos-store-gateway

# Check for errors: kubectl logs -n monitoring -l app=thanos-store-gateway | grep -i "error|fail"

# Check Store Gateway configuration: kubectl get configmap -n monitoring thanos-store-gateway -o yaml

# Test Store Gateway gRPC: kubectl run test-grpc --image=grpcurl/grpcurl --rm -it --restart=Never \ -- grpcurl thanos-store-gateway:10901 grpc.health.v1.Health/Check

# Check bucket access: kubectl logs -n monitoring -l app=thanos-store-gateway | grep -i bucket

# Check object store config: kubectl get secret -n monitoring thanos-objectstorage -o yaml

# Verify bucket connectivity: kubectl run test-s3 --image=amazon/aws-cli --rm -it --restart=Never \ -- aws s3 ls s3://thanos-bucket/ ```

Step 3: Adjust Query Timeout

```bash # Check current timeout: kubectl get deploy thanos-query -n monitoring -o yaml | grep -A5 "args:"

# Query timeout flag: # --query.timeout=2m

# Increase timeout: kubectl set args deploy/thanos-query -n monitoring \ -- --query.timeout=5m

# Store response timeout: # --store.response-timeout=1m

# Increase store timeout: kubectl set args deploy/thanos-query -n monitoring \ -- --store.response-timeout=3m

# Timeout configuration in deployment: args: - query - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:9090 - --query.timeout=5m - --query.max-concurrent=20 - --query.max-samples=50000000 - --store.response-timeout=3m - --store=thanos-store-gateway:10901

# Restart after config change: kubectl rollout restart deploy/thanos-query -n monitoring ```

Step 4: Optimize Query Performance

```bash # Reduce query time range: # Instead of 30 days, query 7 days: query?query=up&start=2024-01-01T00:00:00Z&end=2024-01-08T00:00:00Z

# Use recording rules for heavy queries: # In Prometheus rules: groups: - name: thanos_rules rules: - record: job:up:sum expr: sum(up) by (job)

# Query pre-aggregated data: # Instead of: sum(rate(http_requests_total[5m])) by (job) # Use recording rule: job:http_requests:rate5m

# Limit label matchers: # Add more specific label matchers: # Bad: up # Good: up{job="prometheus", namespace="monitoring"}

# Use query hints: # partial_response=true for best effort: curl 'http://thanos-query:9090/api/v1/query?query=up&partial_response=true'

# Deduplicate query results: # --query.replica-label=replica ```

Step 5: Check Resource Limits

```bash # Check Thanos Query resources: kubectl top pods -n monitoring -l app=thanos-query

# Check resource limits: kubectl get deploy thanos-query -n monitoring -o jsonpath='{.spec.template.spec.containers[0].resources}'

# Increase CPU/memory: kubectl set resources deploy/thanos-query -n monitoring \ --limits=cpu=4,memory=8Gi \ --requests=cpu=2,memory=4Gi

# Check Store Gateway resources: kubectl top pods -n monitoring -l app=thanos-store-gateway

# Increase Store Gateway resources: kubectl set resources deploy/thanos-store-gateway -n monitoring \ --limits=cpu=4,memory=16Gi \ --requests=cpu=2,memory=8Gi

# Check for OOMKilled: kubectl describe pods -n monitoring -l app=thanos-query | grep -i oom

# Memory optimization: # --store.limits.max-item-size=64MB # --store.limits.max-items-per-query=500000 ```

Step 6: Configure Caching

```bash # Enable query cache:

# In-memory cache: args: - query - --query.cache.memory-size=2GB - --query.cache.max-size=1GB

# Memcached cache: args: - query - --query.cache.memcached.addresses=memcached:11211

# Redis cache: args: - query - --query.cache.redis.addr=redis:6379

# Store Gateway cache: args: - store - --cache.index-header.write-back-duration=1h - --store.caching-bucket.config.type=INMEMORY

# For production, use external cache: apiVersion: apps/v1 kind: Deployment metadata: name: memcached spec: template: spec: containers: - name: memcached image: memcached:1.6 args: - -m 4096 - -I 5m ```

Step 7: Check Network Latency

```bash # Test network latency to Store Gateway: kubectl run test-latency --image=curlimages/curl --rm -it --restart=Never \ -- curl -w "Time: %{time_total}s\n" http://thanos-store-gateway:10901/metrics

# Check gRPC latency: kubectl run test-grpc-latency --image=grpcurl/grpcurl --rm -it --restart=Never \ -- grpcurl -vv thanos-store-gateway:10901 grpc.health.v1.Health/Check

# Measure bucket latency: # Time to read from object storage: kubectl logs -n monitoring -l app=thanos-store-gateway | grep -i "latency|duration"

# Check network policies: kubectl get networkpolicy -n monitoring

# Reduce cross-zone traffic: # Deploy Store Gateway same zone as Query

# Use local cache to reduce bucket access: # --store.caching-bucket.config.type=INMEMORY

# For multi-cluster: # Consider federation or downsampling ```

Step 8: Configure Downsampling

```bash # Downsampling reduces data granularity for long time ranges:

# Check downsampling is running: kubectl get pods -n monitoring -l app=thanos-compact

# View Compactor logs: kubectl logs -n monitoring -l app=thanos-compact

# Compactor configuration: args: - compact - --wait - --data-dir=/data - --objstore.config-file=/config/bucket.yaml - --downsampling.disable=false

# Verify downsampled data: # Raw: 5m blocks # Downsampled: 1h blocks # Further downsampled: 5h blocks

# Query uses downsampled data for long ranges: # < 7 days: raw data # 7-30 days: 5m downsampled # > 30 days: 1h downsampled

# Check blocks in bucket: thanos tools bucket ls --bucket-config=bucket.yaml

# Force downsampling: thanos tools bucket rewrite --bucket-config=bucket.yaml -- downsampling ```

Step 9: Check gRPC Configuration

```bash # Check gRPC endpoint: kubectl get svc -n monitoring thanos-query-grpc

# Test gRPC: kubectl run test-grpc --image=grpcurl/grpcurl --rm -it --restart=Never \ -- grpcurl -plaintext thanos-query:10901 list

# gRPC max message size: # --grpc.max-recv-msg-size=100MB # --grpc.max-send-msg-size=100MB

# Increase gRPC limits: args: - query - --grpc-address=0.0.0.0:10901 - --grpc.max-recv-msg-size=100MB - --grpc.max-send-msg-size=100MB

# gRPC keepalive: # --grpc.keepalive.min-time=10s # --grpc.keepalive.timeout=10s

# Check for gRPC errors: kubectl logs -n monitoring -l app=thanos-query | grep -i "grpc|rpc"

# Store Gateway gRPC: kubectl logs -n monitoring -l app=thanos-store-gateway | grep -i "grpc" ```

Step 10: Thanos Query Verification Script

```bash # Create verification script: cat << 'EOF' > /usr/local/bin/check-thanos-query.sh #!/bin/bash

NS=${1:-"monitoring"}

echo "=== Thanos Query Pods ===" kubectl get pods -n $NS -l app=thanos-query

echo "" echo "=== Thanos Store Gateway Pods ===" kubectl get pods -n $NS -l app=thanos-store-gateway

echo "" echo "=== Query Health ===" kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never \ -- curl -s http://thanos-query.$NS:9090/-/healthy 2>/dev/null || echo "Query not healthy"

echo "" echo "=== Connected Stores ===" kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never \ -- curl -s http://thanos-query.$NS:9090/api/v1/stores 2>/dev/null | jq '.' || echo "Cannot list stores"

echo "" echo "=== Query Timeout Errors ===" kubectl logs -n $NS -l app=thanos-query --tail=50 | grep -i "timeout|deadline" | tail -10

echo "" echo "=== Query Configuration ===" kubectl get deploy thanos-query -n $NS -o yaml | grep -A20 "args:"

echo "" echo "=== Resource Usage ===" kubectl top pods -n $NS -l app=thanos-query 2>/dev/null || echo "Metrics not available"

echo "" echo "=== Store Gateway Logs ===" kubectl logs -n $NS -l app=thanos-store-gateway --tail=20 | grep -i "error|fail" | head -10

echo "" echo "=== Compactor Status ===" kubectl get pods -n $NS -l app=thanos-compact

echo "" echo "=== Recommendations ===" echo "1. Increase query timeout for slow queries" echo "2. Configure caching for repeated queries" echo "3. Enable downsampling for long time ranges" echo "4. Optimize query with better label matchers" echo "5. Increase resource limits if OOMKilled" echo "6. Check network latency to store gateways" echo "7. Use partial_response for best effort queries" EOF

chmod +x /usr/local/bin/check-thanos-query.sh

# Usage: /usr/local/bin/check-thanos-query.sh monitoring ```

Thanos Query Timeout Checklist

Check	Expected
Store Gateways	All connected
Query timeout	Adequate for queries
Caching	Enabled
Downsampling	Active
Resources	Within limits
Network	Low latency
Query optimized	Specific label matchers

Verify the Fix

```bash # After fixing Thanos Query timeout

# 1. Check Query healthy curl http://thanos-query:9090/-/healthy // OK

# 2. Test query curl 'http://thanos-query:9090/api/v1/query?query=up' // Returns results

# 3. Check stores connected curl http://thanos-query:9090/api/v1/stores | jq '.data' // All stores listed

# 4. Run long query curl 'http://thanos-query:9090/api/v1/query_range?query=up&start=...&end=...' // Returns without timeout

# 5. Check Query logs kubectl logs -n monitoring -l app=thanos-query --tail=10 // No timeout errors

# 6. Verify caching # Run same query twice // Second query faster ```

[Fix Prometheus Target Down](/articles/fix-prometheus-target-down)
[Fix Prometheus Remote Write Queue Full](/articles/fix-prometheus-remote-write-queue-full)
[Fix Grafana Datasource Error](/articles/fix-grafana-datasource-error)

What's Actually Happening

The Error You'll See

Why This Happens

Step 1: Check Thanos Query Status

Step 2: Check Store Gateway Status

Step 3: Adjust Query Timeout

Step 4: Optimize Query Performance

Step 5: Check Resource Limits

Step 6: Configure Caching

Step 7: Check Network Latency

Step 8: Configure Downsampling

Step 9: Check gRPC Configuration

Step 10: Thanos Query Verification Script

Thanos Query Timeout Checklist

Verify the Fix

Related Issues

Share this guide

More Monitoring Troubleshooting Guides

Metric Retention Expired

Timeseries Storage Full

Collector Agent Crashed

Webhook Notification Timeout

SMS Notification Failed

Fix Fluentd Log Not Sending