Introduction Cosmos DB 429 (Too Many Requests) errors occur when your application exceeds the provisioned Request Units per second. Cosmos DB returns 429 with a retry-after header, but without proper retry logic, these become application errors.

Symptoms - HTTP 429 response: "Request rate is large. RetryAfterMs: 1234" - Azure portal shows Normalized RU Consumption at or near 100% - Increased latency as client retries back off - Application timeouts due to exhausted retry budget

Common Causes - Provisioned RU/s too low for workload - Hot partition: single logical partition receiving too many requests - Cross-partition queries consuming excessive RUs - Missing or incorrect retry policy in SDK - Autoscale ceiling too low for peak traffic

Step-by-Step Fix 1. **Check current RU consumption**: ```bash az monitor metrics list --resource <cosmos-resource-id> \ --metric "NormalizedRUConsumption" "TotalRequestUnits" "ThrottledRequests" --interval PT1M ```

  1. 1.Increase provisioned RU/s or switch to autoscale:
  2. 2.```bash
  3. 3.az cosmosdb sql container throughput migrate \
  4. 4.--account-name my-cosmos --database-name mydb --name mycontainer \
  5. 5.--resource-group my-rg --throughput-type autoscale --max-throughput 10000
  6. 6.`
  7. 7.Implement SDK retry with exponential backoff (C#):
  8. 8.```csharp
  9. 9.var options = new CosmosClientOptions {
  10. 10.ConnectionMode = ConnectionMode.Gateway,
  11. 11.MaxRetryAttemptsOnRateLimitedRequests = 10,
  12. 12.MaxRetryWaitTimeOnRateLimitedRequests = TimeSpan.FromSeconds(30)
  13. 13.};
  14. 14.var client = new CosmosClient(connectionString, options);
  15. 15.`
  16. 16.Use bulk execution for large operations:
  17. 17.```csharp
  18. 18.var options = new CosmosClientOptions { AllowBulkExecution = true };
  19. 19.var client = new CosmosClient(connectionString, options);
  20. 20.var container = client.GetContainer("mydb", "mycontainer");
  21. 21.var tasks = items.Select(item => container.CreateItemAsync(item));
  22. 22.await Task.WhenAll(tasks);
  23. 23.`

Prevention - Monitor Normalized RU Consumption with alert at 80% - Use autoscale for variable workloads - Design partition keys to distribute requests evenly - Implement proper retry logic with exponential backoff - Use server-side batching with stored procedures