Introduction
Service mesh rate limiting configuration errors occur when rate limit policies are misconfigured, the rate limit service is unreachable, or the enforcement logic fails, resulting in either no rate limiting (allowing traffic spikes to overwhelm services) or overly aggressive limiting (returning 429 Too Many Requests for legitimate traffic). Rate limiting in service meshes can be implemented locally (per-proxy) or globally (centralized across all proxies). Common causes include Envoy rate limit filter misconfiguration, Redis backend connection failures for global rate limiting, incorrect descriptor matching in rate limit rules, quota policy syntax errors in Istio, rate limit service timeout too aggressive, misconfigured rate limit headers causing incorrect client behavior, local rate limit conflicting with global limit, and incorrect stats/metrics configuration preventing monitoring. The fix requires understanding Envoy's rate limit architecture, proper descriptor configuration, Redis backend setup for global limits, and Istio/Kubernetes CRD syntax. This guide provides production-proven troubleshooting for rate limiting across Istio, Linkerd, Envoy, and Consul Connect.
Symptoms
- HTTP 429 Too Many Requests returned unexpectedly
- Rate limiting not enforced despite configuration
rate_limit_service_not_healthyin Envoy statsupstream_rq_timeoutfor rate limit service calls- Istio Mixer errors:
rate limit service unavailable - Inconsistent rate limiting across service instances
- Rate limit headers missing or incorrect (X-RateLimit-Limit, X-RateLimit-Remaining)
- Redis connection errors for global rate limiting
- High latency on first request after deploy (rate limit config loading)
InvalidArgumentorBadRequestfrom rate limit service
Common Causes
- Envoy rate limit filter not in correct position in filter chain
- Rate limit service (RLS) not deployed or not reachable
- Redis backend for global rate limiting not configured or unreachable
- Descriptor key/value mismatch between Envoy and RLS config
- Rate limit configuration not loaded or stale
- Timeout for rate limit service call too short
- Local rate limit and global rate limit both enabled with conflicting thresholds
- Rate limit policy applied to wrong workload/namespace
- Istio EnvoyFilter YAML syntax errors
- Rate limit quota exhausted but not resetting
Step-by-Step Fix
### 1. Diagnose rate limiting configuration
Check Envoy rate limit filter:
```bash # Get Envoy config from sidecar istioctl proxy-config envoy <pod-name>.<namespace> --bootstrap
# Check rate limit filter configuration istioctl proxy-config listener <pod-name>.<namespace> -o json | \ jq '.[] | select(.name | contains("rate_limit"))'
# Verify rate limit cluster exists istioctl proxy-config cluster <pod-name>.<namespace> | grep rate_limit
# Check Envoy stats for rate limiting istioctl proxy-config envoy <pod-name>.<namespace> --stats | grep -E "ratelimit|rate_limit"
# Key stats to check: # http.ratelimit.ok - successful rate limit checks # http.ratelimit.error - rate limit service errors # http.ratelimit.over_limit - requests that exceeded limit ```
Test rate limit service connectivity:
```bash # From within the cluster, test RLS connectivity kubectl exec -it <pod-name> -n <namespace> -- curl -v http://rate-limit-service:8080/health
# Test Redis backend (if using global rate limiting) kubectl exec -it <redis-pod> -n <namespace> -- redis-cli ping
# Check Redis keys for rate limiting kubectl exec -it <redis-pod> -n <namespace> -- redis-cli keys "ratelimit:*" ```
### 2. Fix Envoy rate limit filter
Local rate limit configuration:
yaml
# EnvoyFilter for local rate limiting
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: local-rate-limit
namespace: production
spec:
workloadSelector:
labels:
app: my-service
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_INBOUND
listener:
filterChain:
filter:
name: "envoy.filters.network.http_connection_manager"
patch:
operation: INSERT_FIRST
value:
name: envoy.filters.http.local_ratelimit
typed_config:
"@type": type.googleapis.com/udpa.type.v1.TypedStruct
type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
value:
stat_prefix: http_local_rate_limiter
token_bucket:
max_tokens: 100
tokens_per_fill: 10
fill_interval: 1s
filter_enabled:
runtime_key: local_rate_limit_enabled
default_value:
numerator: 100
denominator: HUNDRED
filter_enforced:
runtime_key: local_rate_limit_enforced
default_value:
numerator: 100
denominator: HUNDRED
response_headers_to_add:
- append: false
header:
key: x-local-rate-limit
value: "true"
Global rate limit with Redis:
yaml
# EnvoyFilter for global rate limiting
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: global-rate-limit
namespace: production
spec:
workloadSelector:
labels:
app: my-service
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_INBOUND
listener:
filterChain:
filter:
name: "envoy.filters.network.http_connection_manager"
subFilter:
name: "envoy.filters.http.router"
patch:
operation: INSERT_BEFORE
value:
name: envoy.filters.http.ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: production
failure_mode_deny: false # Allow traffic if RLS unavailable
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: outbound|8081||rate-limit-service.production.svc.cluster.local
timeout: 0.5s
timeout: 0.5s
### 3. Fix rate limit service configuration
Deploy rate limit service:
yaml
# Rate Limit Service deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: rate-limit-service
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: rate-limit-service
template:
metadata:
labels:
app: rate-limit-service
spec:
containers:
- name: rate-limit
image: envoyproxy/ratelimit:latest
ports:
- containerPort: 8080
- containerPort: 8081 # gRPC
env:
- name: USE_STATSD
value: "false"
- name: REDIS_SOCKET_TYPE
value: "tcp"
- name: REDIS_URL
value: "redis:6379"
- name: RUNTIME_ROOT
value: "/data"
- name: RUNTIME_SUBDIRECTORY
value: "ratelimit"
volumeMounts:
- name: config
mountPath: /data/ratelimit
volumes:
- name: config
configMap:
name: rate-limit-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rate-limit-config
namespace: production
data:
config.yaml: |
domain: production
descriptors:
- key: remote_address
rate_limit:
unit: second
requests_per_unit: 100
- key: header_match
value: api-key
rate_limit:
unit: minute
requests_per_unit: 1000
- key: generic_key
value: default
rate_limit:
unit: second
requests_per_unit: 50
### 4. Fix Istio quota policies
Legacy Istio Mixer rate limiting (pre-1.5):
yaml
# Memquota instance
apiVersion: config.istio.io/v1alpha2
kind: memquota
metadata:
name: handler
namespace: istio-system
spec:
quotas:
- name: requestcount.quota.istio-system
max_amount: 1000
valid_duration: 1s
overrides:
- dimensions:
destination: my-service
max_amount: 100
valid_duration: 1s
---
# QuotaSpec for the service
apiVersion: config.istio.io/v1alpha2
kind: QuotaSpec
metadata:
name: request-count
namespace: istio-system
spec:
rules:
- quotas:
- charge: 1
quota: requestcount
---
# Bind QuotaSpec to service
apiVersion: config.istio.io/v1alpha2
kind: QuotaSpecBinding
metadata:
name: request-count
namespace: istio-system
spec:
quotaSpecs:
- name: request-count
namespace: istio-system
services:
- name: my-service
namespace: production
---
# Rule to enable rate limiting
apiVersion: config.istio.io/v1alpha2
kind: rule
metadata:
name: quota
namespace: istio-system
spec:
actions:
- handler: handler.memquota
instances:
- requestcount.quota
### 5. Fix Redis backend issues
Redis configuration for rate limiting:
yaml
# Redis deployment for rate limiting
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
namespace: production
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "128Mi"
cpu: "200m"
command: ["redis-server"]
args: ["--maxmemory", "64mb", "--maxmemory-policy", "allkeys-lru"]
---
apiVersion: v1
kind: Service
metadata:
name: redis
namespace: production
spec:
selector:
app: redis
ports:
- port: 6379
targetPort: 6379
Debug Redis connection:
```bash # Check Redis connectivity from rate limit service kubectl exec -it <rls-pod> -n production -- redis-cli -h redis ping
# Check Redis memory usage kubectl exec -it <redis-pod> -n production -- redis-cli info memory
# Monitor Redis keys being set kubectl exec -it <redis-pod> -n production -- redis-cli monitor | grep ratelimit
# Clear rate limit keys (for testing) kubectl exec -it <redis-pod> -n production -- redis-cli keys "ratelimit:*" | xargs redis-cli del ```
### 6. Fix rate limit headers
Configure response headers:
yaml
# EnvoyFilter to add rate limit headers
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: rate-limit-headers
namespace: production
spec:
workloadSelector:
labels:
app: my-service
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_INBOUND
patch:
operation: MERGE
value:
name: envoy.filters.http.ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: production
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: rate_limit_service
enable_x_ratelimit_headers: true # Add X-RateLimit headers
Expected rate limit headers:
``
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1625140800
Retry-After: 60 # On 429 response
### 7. Debug rate limiting in production
Enable debug logging:
```bash # Increase Envoy log level istioctl proxy-config log <pod-name>.<namespace> --level ratelimit:debug
# Or set via Envoy admin interface curl -X POST http://localhost:15000/logging?level=ratelimit:debug
# Watch rate limit stats watch -n 1 'istioctl proxy-config envoy <pod-name> --stats | grep ratelimit' ```
Test rate limiting:
```bash # Send burst of requests to test rate limiting for i in {1..150}; do curl -s -o /dev/null -w "%{http_code}\n" https://my-service.example.com/api/endpoint done | sort | uniq -c
# Expected output if rate limit is 100/s: # 100 200 # 50 429
# Use ab (Apache Benchmark) for load testing ab -n 1000 -c 10 https://my-service.example.com/api/endpoint
# Use vegeta for sustained load echo "GET https://my-service.example.com/api/endpoint" | \ vegeta attack -duration=30s -rate=100 | vegeta report ```
Prevention
- Test rate limit configuration in staging before production
- Set failure_mode_deny: false to allow traffic if RLS unavailable
- Monitor rate limit metrics (429 count, RLS latency, Redis latency)
- Use gradual rollout for rate limit policy changes
- Document rate limit thresholds and escalation procedures
- Set up alerts for unusual 429 rates
- Use distributed tracing to track rate limit decisions
- Implement client-side retry with exponential backoff
- Cache rate limit decisions locally to reduce RLS calls
- Regular load testing to validate rate limit effectiveness
Related Errors
- **Service mesh sidecar injection failed**: Sidecar not injected into pod
- **Service mesh mTLS connection failed**: Certificate or policy mismatch
- **Service mesh destination rule configuration error**: Traffic policy misconfiguration
- **Envoy upstream connect timeout**: Backend service unreachable