Introduction

Service mesh rate limiting configuration errors occur when rate limit policies are misconfigured, the rate limit service is unreachable, or the enforcement logic fails, resulting in either no rate limiting (allowing traffic spikes to overwhelm services) or overly aggressive limiting (returning 429 Too Many Requests for legitimate traffic). Rate limiting in service meshes can be implemented locally (per-proxy) or globally (centralized across all proxies). Common causes include Envoy rate limit filter misconfiguration, Redis backend connection failures for global rate limiting, incorrect descriptor matching in rate limit rules, quota policy syntax errors in Istio, rate limit service timeout too aggressive, misconfigured rate limit headers causing incorrect client behavior, local rate limit conflicting with global limit, and incorrect stats/metrics configuration preventing monitoring. The fix requires understanding Envoy's rate limit architecture, proper descriptor configuration, Redis backend setup for global limits, and Istio/Kubernetes CRD syntax. This guide provides production-proven troubleshooting for rate limiting across Istio, Linkerd, Envoy, and Consul Connect.

Symptoms

  • HTTP 429 Too Many Requests returned unexpectedly
  • Rate limiting not enforced despite configuration
  • rate_limit_service_not_healthy in Envoy stats
  • upstream_rq_timeout for rate limit service calls
  • Istio Mixer errors: rate limit service unavailable
  • Inconsistent rate limiting across service instances
  • Rate limit headers missing or incorrect (X-RateLimit-Limit, X-RateLimit-Remaining)
  • Redis connection errors for global rate limiting
  • High latency on first request after deploy (rate limit config loading)
  • InvalidArgument or BadRequest from rate limit service

Common Causes

  • Envoy rate limit filter not in correct position in filter chain
  • Rate limit service (RLS) not deployed or not reachable
  • Redis backend for global rate limiting not configured or unreachable
  • Descriptor key/value mismatch between Envoy and RLS config
  • Rate limit configuration not loaded or stale
  • Timeout for rate limit service call too short
  • Local rate limit and global rate limit both enabled with conflicting thresholds
  • Rate limit policy applied to wrong workload/namespace
  • Istio EnvoyFilter YAML syntax errors
  • Rate limit quota exhausted but not resetting

Step-by-Step Fix

### 1. Diagnose rate limiting configuration

Check Envoy rate limit filter:

```bash # Get Envoy config from sidecar istioctl proxy-config envoy <pod-name>.<namespace> --bootstrap

# Check rate limit filter configuration istioctl proxy-config listener <pod-name>.<namespace> -o json | \ jq '.[] | select(.name | contains("rate_limit"))'

# Verify rate limit cluster exists istioctl proxy-config cluster <pod-name>.<namespace> | grep rate_limit

# Check Envoy stats for rate limiting istioctl proxy-config envoy <pod-name>.<namespace> --stats | grep -E "ratelimit|rate_limit"

# Key stats to check: # http.ratelimit.ok - successful rate limit checks # http.ratelimit.error - rate limit service errors # http.ratelimit.over_limit - requests that exceeded limit ```

Test rate limit service connectivity:

```bash # From within the cluster, test RLS connectivity kubectl exec -it <pod-name> -n <namespace> -- curl -v http://rate-limit-service:8080/health

# Test Redis backend (if using global rate limiting) kubectl exec -it <redis-pod> -n <namespace> -- redis-cli ping

# Check Redis keys for rate limiting kubectl exec -it <redis-pod> -n <namespace> -- redis-cli keys "ratelimit:*" ```

### 2. Fix Envoy rate limit filter

Local rate limit configuration:

yaml # EnvoyFilter for local rate limiting apiVersion: networking.istio.io/v1alpha3 kind: EnvoyFilter metadata: name: local-rate-limit namespace: production spec: workloadSelector: labels: app: my-service configPatches: - applyTo: HTTP_FILTER match: context: SIDECAR_INBOUND listener: filterChain: filter: name: "envoy.filters.network.http_connection_manager" patch: operation: INSERT_FIRST value: name: envoy.filters.http.local_ratelimit typed_config: "@type": type.googleapis.com/udpa.type.v1.TypedStruct type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit value: stat_prefix: http_local_rate_limiter token_bucket: max_tokens: 100 tokens_per_fill: 10 fill_interval: 1s filter_enabled: runtime_key: local_rate_limit_enabled default_value: numerator: 100 denominator: HUNDRED filter_enforced: runtime_key: local_rate_limit_enforced default_value: numerator: 100 denominator: HUNDRED response_headers_to_add: - append: false header: key: x-local-rate-limit value: "true"

Global rate limit with Redis:

yaml # EnvoyFilter for global rate limiting apiVersion: networking.istio.io/v1alpha3 kind: EnvoyFilter metadata: name: global-rate-limit namespace: production spec: workloadSelector: labels: app: my-service configPatches: - applyTo: HTTP_FILTER match: context: SIDECAR_INBOUND listener: filterChain: filter: name: "envoy.filters.network.http_connection_manager" subFilter: name: "envoy.filters.http.router" patch: operation: INSERT_BEFORE value: name: envoy.filters.http.ratelimit typed_config: "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit domain: production failure_mode_deny: false # Allow traffic if RLS unavailable rate_limit_service: grpc_service: envoy_grpc: cluster_name: outbound|8081||rate-limit-service.production.svc.cluster.local timeout: 0.5s timeout: 0.5s

### 3. Fix rate limit service configuration

Deploy rate limit service:

yaml # Rate Limit Service deployment apiVersion: apps/v1 kind: Deployment metadata: name: rate-limit-service namespace: production spec: replicas: 3 selector: matchLabels: app: rate-limit-service template: metadata: labels: app: rate-limit-service spec: containers: - name: rate-limit image: envoyproxy/ratelimit:latest ports: - containerPort: 8080 - containerPort: 8081 # gRPC env: - name: USE_STATSD value: "false" - name: REDIS_SOCKET_TYPE value: "tcp" - name: REDIS_URL value: "redis:6379" - name: RUNTIME_ROOT value: "/data" - name: RUNTIME_SUBDIRECTORY value: "ratelimit" volumeMounts: - name: config mountPath: /data/ratelimit volumes: - name: config configMap: name: rate-limit-config --- apiVersion: v1 kind: ConfigMap metadata: name: rate-limit-config namespace: production data: config.yaml: | domain: production descriptors: - key: remote_address rate_limit: unit: second requests_per_unit: 100 - key: header_match value: api-key rate_limit: unit: minute requests_per_unit: 1000 - key: generic_key value: default rate_limit: unit: second requests_per_unit: 50

### 4. Fix Istio quota policies

Legacy Istio Mixer rate limiting (pre-1.5):

yaml # Memquota instance apiVersion: config.istio.io/v1alpha2 kind: memquota metadata: name: handler namespace: istio-system spec: quotas: - name: requestcount.quota.istio-system max_amount: 1000 valid_duration: 1s overrides: - dimensions: destination: my-service max_amount: 100 valid_duration: 1s --- # QuotaSpec for the service apiVersion: config.istio.io/v1alpha2 kind: QuotaSpec metadata: name: request-count namespace: istio-system spec: rules: - quotas: - charge: 1 quota: requestcount --- # Bind QuotaSpec to service apiVersion: config.istio.io/v1alpha2 kind: QuotaSpecBinding metadata: name: request-count namespace: istio-system spec: quotaSpecs: - name: request-count namespace: istio-system services: - name: my-service namespace: production --- # Rule to enable rate limiting apiVersion: config.istio.io/v1alpha2 kind: rule metadata: name: quota namespace: istio-system spec: actions: - handler: handler.memquota instances: - requestcount.quota

### 5. Fix Redis backend issues

Redis configuration for rate limiting:

yaml # Redis deployment for rate limiting apiVersion: apps/v1 kind: Deployment metadata: name: redis namespace: production spec: replicas: 1 selector: matchLabels: app: redis template: metadata: labels: app: redis spec: containers: - name: redis image: redis:7-alpine ports: - containerPort: 6379 resources: requests: memory: "64Mi" cpu: "100m" limits: memory: "128Mi" cpu: "200m" command: ["redis-server"] args: ["--maxmemory", "64mb", "--maxmemory-policy", "allkeys-lru"] --- apiVersion: v1 kind: Service metadata: name: redis namespace: production spec: selector: app: redis ports: - port: 6379 targetPort: 6379

Debug Redis connection:

```bash # Check Redis connectivity from rate limit service kubectl exec -it <rls-pod> -n production -- redis-cli -h redis ping

# Check Redis memory usage kubectl exec -it <redis-pod> -n production -- redis-cli info memory

# Monitor Redis keys being set kubectl exec -it <redis-pod> -n production -- redis-cli monitor | grep ratelimit

# Clear rate limit keys (for testing) kubectl exec -it <redis-pod> -n production -- redis-cli keys "ratelimit:*" | xargs redis-cli del ```

### 6. Fix rate limit headers

Configure response headers:

yaml # EnvoyFilter to add rate limit headers apiVersion: networking.istio.io/v1alpha3 kind: EnvoyFilter metadata: name: rate-limit-headers namespace: production spec: workloadSelector: labels: app: my-service configPatches: - applyTo: HTTP_FILTER match: context: SIDECAR_INBOUND patch: operation: MERGE value: name: envoy.filters.http.ratelimit typed_config: "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit domain: production rate_limit_service: grpc_service: envoy_grpc: cluster_name: rate_limit_service enable_x_ratelimit_headers: true # Add X-RateLimit headers

Expected rate limit headers: `` X-RateLimit-Limit: 100 X-RateLimit-Remaining: 42 X-RateLimit-Reset: 1625140800 Retry-After: 60 # On 429 response

### 7. Debug rate limiting in production

Enable debug logging:

```bash # Increase Envoy log level istioctl proxy-config log <pod-name>.<namespace> --level ratelimit:debug

# Or set via Envoy admin interface curl -X POST http://localhost:15000/logging?level=ratelimit:debug

# Watch rate limit stats watch -n 1 'istioctl proxy-config envoy <pod-name> --stats | grep ratelimit' ```

Test rate limiting:

```bash # Send burst of requests to test rate limiting for i in {1..150}; do curl -s -o /dev/null -w "%{http_code}\n" https://my-service.example.com/api/endpoint done | sort | uniq -c

# Expected output if rate limit is 100/s: # 100 200 # 50 429

# Use ab (Apache Benchmark) for load testing ab -n 1000 -c 10 https://my-service.example.com/api/endpoint

# Use vegeta for sustained load echo "GET https://my-service.example.com/api/endpoint" | \ vegeta attack -duration=30s -rate=100 | vegeta report ```

Prevention

  • Test rate limit configuration in staging before production
  • Set failure_mode_deny: false to allow traffic if RLS unavailable
  • Monitor rate limit metrics (429 count, RLS latency, Redis latency)
  • Use gradual rollout for rate limit policy changes
  • Document rate limit thresholds and escalation procedures
  • Set up alerts for unusual 429 rates
  • Use distributed tracing to track rate limit decisions
  • Implement client-side retry with exponential backoff
  • Cache rate limit decisions locally to reduce RLS calls
  • Regular load testing to validate rate limit effectiveness
  • **Service mesh sidecar injection failed**: Sidecar not injected into pod
  • **Service mesh mTLS connection failed**: Certificate or policy mismatch
  • **Service mesh destination rule configuration error**: Traffic policy misconfiguration
  • **Envoy upstream connect timeout**: Backend service unreachable