Introduction

Azure Load Balancer health probes monitor backend pool VMs to determine which instances should receive traffic. When probes fail, VMs are removed from the load balancing rotation, potentially causing capacity reduction or complete service failure. Health probe failures can result from endpoint issues, NSG rules, probe configuration, or application-level problems.

Symptoms

Error indicators in Azure portal and logs:

bash
Backend VM is unhealthy
Health probe failed: endpoint not responding
Probe path returned non-200 status
Backend pool has no healthy instances

Observable indicators: - VMs show "Unhealthy" status in Backend Pool - Load balancer returns 502/503 to clients - Traffic not distributed to expected VMs - DipAvailability metric shows failures - Application Insights shows probe failures

Common Causes

  1. 1.Probe path not accessible - Endpoint returns 404 or doesn't exist
  2. 2.NSG blocking probe traffic - Security rules block AzureLoadBalancer tag
  3. 3.Probe interval too short - Application can't respond in time
  4. 4.Wrong probe port - Different port than application listening
  5. 5.HTTPS probe on HTTP endpoint - Protocol mismatch
  6. 6.VM under heavy load - Cannot respond within probe timeout
  7. 7.Internal vs External probe - Probe source IP varies by LB type

Step-by-Step Fix

Step 1: Check Backend Pool Status

```bash # Get load balancer details az network lb show --name my-lb --resource-group my-rg --query "backendAddressPools"

# Check backend pool health az network lb show --name my-lb --resource-group my-rg \ --query "backendAddressPools[].backendIPConfigurations[].{Name:name,Provisioning:provisioningState}"

# List probe configuration az network lb probe list --lb-name my-lb --resource-group my-rg

# Get probe details az network lb probe show --lb-name my-lb --resource-group my-rg --name my-probe ```

Step 2: Test Probe Endpoint from VM

```bash # SSH to backend VM az vm ssh --name my-vm --resource-group my-rg

# Test probe endpoint locally curl -v http://localhost:80/health

# Test exact probe path curl -v http://127.0.0.1:80/probe-path

# Check response timing curl -w "Connect: %{time_connect}s TTFB: %{time_starttransfer}s\n" \ http://localhost/health

# Check if application is listening on probe port netstat -tlnp | grep :80 ss -tlnp | grep :80 ```

Step 3: Check NSG Rules

```bash # Get VM's NSG az vm show --name my-vm --resource-group my-rg --query "networkProfile.networkInterfaces[].id"

# Get NSG details az network nsg show --name my-nsg --resource-group my-rg

# List NSG rules az network nsg rule list --nsg-name my-nsg --resource-group my-rg --output table

# Check for AzureLoadBalancer source rule az network nsg rule list --nsg-name my-nsg --resource-group my-rg \ --query "[?source=='AzureLoadBalancer']" ```

bash
# Add rule to allow health probe
az network nsg rule create \
    --resource-group my-rg \
    --nsg-name my-nsg \
    --name AllowAzureLoadBalancer \
    --protocol Tcp \
    --direction Inbound \
    --source-address-prefix AzureLoadBalancer \
    --source-port-range "*" \
    --destination-address-prefix "*" \
    --destination-port-range 80 \
    --access Allow \
    --priority 100

Step 4: Fix Probe Configuration

```bash # Update probe settings az network lb probe update \ --lb-name my-lb \ --resource-group my-rg \ --name my-probe \ --path /health \ --port 80 \ --protocol Http \ --interval 30 \ --timeout 10 \ --threshold 3

# Create new probe az network lb probe create \ --lb-name my-lb \ --resource-group my-rg \ --name health-probe \ --protocol Http \ --port 80 \ --path /health \ --interval 30 \ --timeout 10 \ --threshold 3 ```

Step 5: Fix HTTP vs HTTPS Protocol

```bash # For HTTPS endpoints az network lb probe update \ --lb-name my-lb \ --resource-group my-rg \ --name my-probe \ --protocol Https \ --port 443 \ --path /health

# For TCP probes (no HTTP response check) az network lb probe update \ --lb-name my-lb \ --resource-group my-rg \ --name my-probe \ --protocol Tcp \ --port 80 ```

Step 6: Check Load Balancer Rule Association

```bash # Check if probe is associated with LB rule az network lb rule show \ --lb-name my-lb \ --resource-group my-rg \ --name my-rule \ --query "{Probe:probe.id,BackendPool:backendAddressPool.id}"

# Associate probe with rule az network lb rule update \ --lb-name my-lb \ --resource-group my-rg \ --name my-rule \ --probe my-probe ```

Step 7: Monitor Probe Health Metrics

```bash # Check DipAvailability metric az monitor metrics list \ --resource /subscriptions/sub/resourceGroups/rg/providers/Microsoft.Network/loadBalancers/my-lb \ --metric "DipAvailability" \ --interval PT1M \ --aggregation Average

# Check VipAvailability az monitor metrics list \ --resource /subscriptions/sub/resourceGroups/rg/providers/Microsoft.Network/loadBalancers/my-lb \ --metric "VipAvailability" ```

Step 8: Verify the Fix

```bash # Check backend pool health az network lb show --name my-lb --resource-group my-rg \ --query "backendAddressPools[].backendIPConfigurations"

# Test through load balancer curl -v http://lb-ip-address/health

# Monitor probe status az monitor activity-log list --resource-group my-rg \ --caller "Microsoft.Network" --max-events 10 ```

Advanced Diagnosis

Check Internal Load Balancer Probe Source

```bash # For internal LB, probe comes from 168.63.129.16 # This is Azure's health probe source IP

# Allow this IP in NSG az network nsg rule create \ --resource-group my-rg \ --nsg-name my-nsg \ --name AllowHealthProbeIP \ --protocol Tcp \ --direction Inbound \ --source-address-prefix 168.63.129.16 \ --destination-port-range 80 \ --access Allow \ --priority 100 ```

Check Application Logs

```bash # On VM, check application logs for probe requests grep -i "probe|health" /var/log/app/app.log

# Check if app is returning correct status curl -v http://localhost:80/health | grep "HTTP/" ```

Use TCP Probe for Non-HTTP

bash
# For non-HTTP services
az network lb probe create \
    --lb-name my-lb \
    --resource-group my-rg \
    --name tcp-probe \
    --protocol Tcp \
    --port 3306 \
    --interval 30 \
    --threshold 3

Diagnose with NSG Flow Logs

```bash # Enable NSG flow logs az network watcher flow-log configure \ --resource-group my-rg \ --nsg my-nsg \ --storage-account my-storage \ --enabled true

# Analyze flow logs for probe traffic az storage blob list --container-name nsg-flow-logs --account-name my-storage ```

Common Pitfalls

  • Missing AzureLoadBalancer NSG rule - Most common cause of probe failure
  • Probe interval too aggressive - Application can't respond fast enough
  • Wrong probe port - Port mismatch with application
  • HTTP probe expecting HTTPS - Protocol mismatch
  • Probe threshold too low - Single failure marks unhealthy
  • Internal LB probe IP - Different source IP for internal vs external
  • Application returning redirect - Probe expects direct 200 response

Best Practices

```bash # Complete probe configuration az network lb probe create \ --lb-name my-lb \ --resource-group my-rg \ --name app-health-probe \ --protocol Http \ --port 80 \ --path /health \ --interval 30 \ --timeout 10 \ --threshold 3

# Create comprehensive NSG rule az network nsg rule create \ --resource-group my-rg \ --nsg-name my-nsg \ --name AllowHealthProbes \ --protocol Tcp \ --direction Inbound \ --source-address-prefix AzureLoadBalancer \ --source-port-range "*" \ --destination-address-prefix "*" \ --destination-port-range 80-80 \ --access Allow \ --priority 100

# For internal LB, also allow probe IP az network nsg rule create \ --resource-group my-rg \ --nsg-name my-nsg \ --name AllowInternalProbe \ --protocol Tcp \ --direction Inbound \ --source-address-prefix 168.63.129.16 \ --destination-port-range "*" \ --access Allow \ --priority 101 ```

bicep
// Bicep template for load balancer with proper probe
resource loadBalancer 'Microsoft.Network/loadBalancers@2023-02-01' = {
  name: 'app-lb'
  location: location
  sku: {
    name: 'Standard'
  }
  properties: {
    frontendIPConfigurations: [...]
    backendAddressPools: [...]
    probes: [
      {
        name: 'health-probe'
        properties: {
          protocol: 'Http'
          port: 80
          requestPath: '/health'
          intervalInSeconds: 30
          numberOfProbes: 3
        }
      }
    ]
    loadBalancingRules: [
      {
        name: 'http-rule'
        properties: {
          frontendPort: 80
          backendPort: 80
          protocol: 'Tcp'
          probe: {
            id: resourceId('Microsoft.Network/loadBalancers/probes', 'app-lb', 'health-probe')
          }
        }
      }
    ]
  }
}
  • AWS ALB Health Check Failing
  • GCP Load Balancer Backend Down
  • HAProxy Health Check Failing
  • F5 BIG-IP Pool Member Down