What's Actually Happening
Thanos Query returns timeout errors when querying metrics. Long-running queries fail before completion.
The Error You'll See
```bash $ curl 'http://thanos-query:9090/api/v1/query?query=up'
Error: context deadline exceeded ```
Store gateway timeout:
Error: fetching series: context deadline exceeded for store gatewayQuery range timeout:
Error: query range: timeout after 2m0sgRPC error:
Error: rpc error: code = DeadlineExceeded desc = context deadline exceededWhy This Happens
- 1.Query too complex - Query scans too much data
- 2.Store gateway slow - Remote store access delayed
- 3.Network latency - High latency between components
- 4.Resource limits - Thanos pods CPU/memory exhausted
- 5.Timeout too short - Query timeout insufficient
- 6.Downsampling missing - Querying raw data without downsampling
Step 1: Check Thanos Query Status
```bash # Check Thanos Query pods: kubectl get pods -n monitoring -l app=thanos-query
# View Query logs: kubectl logs -n monitoring -l app=thanos-query
# Check for timeout errors: kubectl logs -n monitoring -l app=thanos-query | grep -i "timeout|deadline"
# Check Query configuration: kubectl get configmap -n monitoring thanos-query -o yaml
# Check Query service: kubectl get svc -n monitoring thanos-query
# Test Query endpoint: kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never \ -- curl -s http://thanos-query:9090/-/healthy
# Check Query UI: kubectl port-forward -n monitoring svc/thanos-query 9090:9090 & # Open http://localhost:9090
# Check connected stores: curl http://thanos-query:9090/api/v1/stores | jq . ```
Step 2: Check Store Gateway Status
```bash # Check Store Gateway pods: kubectl get pods -n monitoring -l app=thanos-store-gateway
# View Store Gateway logs: kubectl logs -n monitoring -l app=thanos-store-gateway
# Check for errors: kubectl logs -n monitoring -l app=thanos-store-gateway | grep -i "error|fail"
# Check Store Gateway configuration: kubectl get configmap -n monitoring thanos-store-gateway -o yaml
# Test Store Gateway gRPC: kubectl run test-grpc --image=grpcurl/grpcurl --rm -it --restart=Never \ -- grpcurl thanos-store-gateway:10901 grpc.health.v1.Health/Check
# Check bucket access: kubectl logs -n monitoring -l app=thanos-store-gateway | grep -i bucket
# Check object store config: kubectl get secret -n monitoring thanos-objectstorage -o yaml
# Verify bucket connectivity: kubectl run test-s3 --image=amazon/aws-cli --rm -it --restart=Never \ -- aws s3 ls s3://thanos-bucket/ ```
Step 3: Adjust Query Timeout
```bash # Check current timeout: kubectl get deploy thanos-query -n monitoring -o yaml | grep -A5 "args:"
# Query timeout flag: # --query.timeout=2m
# Increase timeout: kubectl set args deploy/thanos-query -n monitoring \ -- --query.timeout=5m
# Store response timeout: # --store.response-timeout=1m
# Increase store timeout: kubectl set args deploy/thanos-query -n monitoring \ -- --store.response-timeout=3m
# Timeout configuration in deployment: args: - query - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:9090 - --query.timeout=5m - --query.max-concurrent=20 - --query.max-samples=50000000 - --store.response-timeout=3m - --store=thanos-store-gateway:10901
# Restart after config change: kubectl rollout restart deploy/thanos-query -n monitoring ```
Step 4: Optimize Query Performance
```bash # Reduce query time range: # Instead of 30 days, query 7 days: query?query=up&start=2024-01-01T00:00:00Z&end=2024-01-08T00:00:00Z
# Use recording rules for heavy queries: # In Prometheus rules: groups: - name: thanos_rules rules: - record: job:up:sum expr: sum(up) by (job)
# Query pre-aggregated data: # Instead of: sum(rate(http_requests_total[5m])) by (job) # Use recording rule: job:http_requests:rate5m
# Limit label matchers: # Add more specific label matchers: # Bad: up # Good: up{job="prometheus", namespace="monitoring"}
# Use query hints: # partial_response=true for best effort: curl 'http://thanos-query:9090/api/v1/query?query=up&partial_response=true'
# Deduplicate query results: # --query.replica-label=replica ```
Step 5: Check Resource Limits
```bash # Check Thanos Query resources: kubectl top pods -n monitoring -l app=thanos-query
# Check resource limits: kubectl get deploy thanos-query -n monitoring -o jsonpath='{.spec.template.spec.containers[0].resources}'
# Increase CPU/memory: kubectl set resources deploy/thanos-query -n monitoring \ --limits=cpu=4,memory=8Gi \ --requests=cpu=2,memory=4Gi
# Check Store Gateway resources: kubectl top pods -n monitoring -l app=thanos-store-gateway
# Increase Store Gateway resources: kubectl set resources deploy/thanos-store-gateway -n monitoring \ --limits=cpu=4,memory=16Gi \ --requests=cpu=2,memory=8Gi
# Check for OOMKilled: kubectl describe pods -n monitoring -l app=thanos-query | grep -i oom
# Memory optimization: # --store.limits.max-item-size=64MB # --store.limits.max-items-per-query=500000 ```
Step 6: Configure Caching
```bash # Enable query cache:
# In-memory cache: args: - query - --query.cache.memory-size=2GB - --query.cache.max-size=1GB
# Memcached cache: args: - query - --query.cache.memcached.addresses=memcached:11211
# Redis cache: args: - query - --query.cache.redis.addr=redis:6379
# Store Gateway cache: args: - store - --cache.index-header.write-back-duration=1h - --store.caching-bucket.config.type=INMEMORY
# For production, use external cache: apiVersion: apps/v1 kind: Deployment metadata: name: memcached spec: template: spec: containers: - name: memcached image: memcached:1.6 args: - -m 4096 - -I 5m ```
Step 7: Check Network Latency
```bash # Test network latency to Store Gateway: kubectl run test-latency --image=curlimages/curl --rm -it --restart=Never \ -- curl -w "Time: %{time_total}s\n" http://thanos-store-gateway:10901/metrics
# Check gRPC latency: kubectl run test-grpc-latency --image=grpcurl/grpcurl --rm -it --restart=Never \ -- grpcurl -vv thanos-store-gateway:10901 grpc.health.v1.Health/Check
# Measure bucket latency: # Time to read from object storage: kubectl logs -n monitoring -l app=thanos-store-gateway | grep -i "latency|duration"
# Check network policies: kubectl get networkpolicy -n monitoring
# Reduce cross-zone traffic: # Deploy Store Gateway same zone as Query
# Use local cache to reduce bucket access: # --store.caching-bucket.config.type=INMEMORY
# For multi-cluster: # Consider federation or downsampling ```
Step 8: Configure Downsampling
```bash # Downsampling reduces data granularity for long time ranges:
# Check downsampling is running: kubectl get pods -n monitoring -l app=thanos-compact
# View Compactor logs: kubectl logs -n monitoring -l app=thanos-compact
# Compactor configuration: args: - compact - --wait - --data-dir=/data - --objstore.config-file=/config/bucket.yaml - --downsampling.disable=false
# Verify downsampled data: # Raw: 5m blocks # Downsampled: 1h blocks # Further downsampled: 5h blocks
# Query uses downsampled data for long ranges: # < 7 days: raw data # 7-30 days: 5m downsampled # > 30 days: 1h downsampled
# Check blocks in bucket: thanos tools bucket ls --bucket-config=bucket.yaml
# Force downsampling: thanos tools bucket rewrite --bucket-config=bucket.yaml -- downsampling ```
Step 9: Check gRPC Configuration
```bash # Check gRPC endpoint: kubectl get svc -n monitoring thanos-query-grpc
# Test gRPC: kubectl run test-grpc --image=grpcurl/grpcurl --rm -it --restart=Never \ -- grpcurl -plaintext thanos-query:10901 list
# gRPC max message size: # --grpc.max-recv-msg-size=100MB # --grpc.max-send-msg-size=100MB
# Increase gRPC limits: args: - query - --grpc-address=0.0.0.0:10901 - --grpc.max-recv-msg-size=100MB - --grpc.max-send-msg-size=100MB
# gRPC keepalive: # --grpc.keepalive.min-time=10s # --grpc.keepalive.timeout=10s
# Check for gRPC errors: kubectl logs -n monitoring -l app=thanos-query | grep -i "grpc|rpc"
# Store Gateway gRPC: kubectl logs -n monitoring -l app=thanos-store-gateway | grep -i "grpc" ```
Step 10: Thanos Query Verification Script
```bash # Create verification script: cat << 'EOF' > /usr/local/bin/check-thanos-query.sh #!/bin/bash
NS=${1:-"monitoring"}
echo "=== Thanos Query Pods ===" kubectl get pods -n $NS -l app=thanos-query
echo "" echo "=== Thanos Store Gateway Pods ===" kubectl get pods -n $NS -l app=thanos-store-gateway
echo "" echo "=== Query Health ===" kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never \ -- curl -s http://thanos-query.$NS:9090/-/healthy 2>/dev/null || echo "Query not healthy"
echo "" echo "=== Connected Stores ===" kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never \ -- curl -s http://thanos-query.$NS:9090/api/v1/stores 2>/dev/null | jq '.' || echo "Cannot list stores"
echo "" echo "=== Query Timeout Errors ===" kubectl logs -n $NS -l app=thanos-query --tail=50 | grep -i "timeout|deadline" | tail -10
echo "" echo "=== Query Configuration ===" kubectl get deploy thanos-query -n $NS -o yaml | grep -A20 "args:"
echo "" echo "=== Resource Usage ===" kubectl top pods -n $NS -l app=thanos-query 2>/dev/null || echo "Metrics not available"
echo "" echo "=== Store Gateway Logs ===" kubectl logs -n $NS -l app=thanos-store-gateway --tail=20 | grep -i "error|fail" | head -10
echo "" echo "=== Compactor Status ===" kubectl get pods -n $NS -l app=thanos-compact
echo "" echo "=== Recommendations ===" echo "1. Increase query timeout for slow queries" echo "2. Configure caching for repeated queries" echo "3. Enable downsampling for long time ranges" echo "4. Optimize query with better label matchers" echo "5. Increase resource limits if OOMKilled" echo "6. Check network latency to store gateways" echo "7. Use partial_response for best effort queries" EOF
chmod +x /usr/local/bin/check-thanos-query.sh
# Usage: /usr/local/bin/check-thanos-query.sh monitoring ```
Thanos Query Timeout Checklist
| Check | Expected |
|---|---|
| Store Gateways | All connected |
| Query timeout | Adequate for queries |
| Caching | Enabled |
| Downsampling | Active |
| Resources | Within limits |
| Network | Low latency |
| Query optimized | Specific label matchers |
Verify the Fix
```bash # After fixing Thanos Query timeout
# 1. Check Query healthy curl http://thanos-query:9090/-/healthy // OK
# 2. Test query curl 'http://thanos-query:9090/api/v1/query?query=up' // Returns results
# 3. Check stores connected curl http://thanos-query:9090/api/v1/stores | jq '.data' // All stores listed
# 4. Run long query curl 'http://thanos-query:9090/api/v1/query_range?query=up&start=...&end=...' // Returns without timeout
# 5. Check Query logs kubectl logs -n monitoring -l app=thanos-query --tail=10 // No timeout errors
# 6. Verify caching # Run same query twice // Second query faster ```
Related Issues
- [Fix Prometheus Target Down](/articles/fix-prometheus-target-down)
- [Fix Prometheus Remote Write Queue Full](/articles/fix-prometheus-remote-write-queue-full)
- [Fix Grafana Datasource Error](/articles/fix-grafana-datasource-error)