Introduction

gRPC streaming connection lost unexpectedly occurs when long-lived bidirectional or server-side streaming connections are terminated prematurely by intermediaries (proxies, load balancers) or due to misconfigured keepalive settings. Unlike unary RPCs which complete quickly, streaming RPCs maintain persistent connections that can be killed by idle timeouts, missing keepalive pings, HTTP/2 protocol errors, or network interruptions. When connections drop, clients receive UNAVAILABLE, DEADLINE_EXCEEDED, or INTERNAL errors, and streaming operations fail mid-execution.

Symptoms

  • Client receives UNAVAILABLE: connection closed or DEADLINE_EXCEEDED during streaming
  • Server logs show stream closed or client disconnected without completion
  • Connections drop after consistent idle period (30s, 60s, 300s - indicates timeout)
  • Streaming works locally but fails through load balancer or proxy
  • HTTP/2 GOAWAY frames received unexpectedly
  • Issue appears after deploying behind proxy, enabling mTLS, changing network config, or scaling to multiple server instances

Common Causes

  • HTTP/2 keepalive not configured or interval too long for proxy timeout
  • Load balancer idle timeout shorter than keepalive interval
  • Proxy (Envoy, Nginx) terminating idle HTTP/2 connections
  • Client or server deadline too short for streaming duration
  • Network equipment (firewall, NAT) dropping idle TCP connections
  • HTTP/2 max concurrent streams limit reached
  • mTLS certificate rotation breaking existing connections
  • Server shutting down without graceful connection draining

Step-by-Step Fix

### 1. Enable gRPC debug logging

Capture detailed connection events:

```bash # Enable gRPC internal logging (Go) export GRPC_GO_LOG_VERBOSITY_LEVEL=99 export GRPC_GO_LOG_SEVERITY_LEVEL=info

# Java gRPC logging # logging.properties io.grpc.level=FINEST io.netty.level=FINE

# Python gRPC import grpc grpc.enable_tracing()

# Check logs for disconnection reason journalctl -u myservice -f | grep -E "grpc|http2|stream|GOAWAY" ```

Key log patterns: - transport: http2Server.notifyError: connection error: Connection dropped - http2: sent GOAWAY: Server terminating connection - keepalive ping too many pings: Client sending pings too frequently - keepalive enforcement too many pings: Server rejecting client pings

### 2. Configure HTTP/2 keepalive

Keepalive prevents intermediaries from killing idle connections:

```go // Server-side keepalive configuration (Go) import "google.golang.org/grpc/keepalive"

var keepaliveParams = keepalive.ServerParameters{ MaxConnectionIdle: 15 * time.Minute, // Max idle time before GOAWAY MaxConnectionAge: 30 * time.Minute, // Max connection age before GOAWAY MaxConnectionAgeGrace: 5 * time.Minute, // Grace period for streams to finish Time: 30 * time.Second, // Ping interval Timeout: 5 * time.Second, // Ping timeout }

var enforcePolicy = keepalive.EnforcementPolicy{ MinTime: 10 * time.Second, // Minimum time between client pings PermitWithoutStream: true, // Allow pings with no active streams }

server := grpc.NewServer( grpc.KeepaliveParams(keepaliveParams), grpc.KeepaliveEnforcementPolicy(enforcePolicy), ) ```

Client-side keepalive:

```go // Client keepalive configuration import "google.golang.org/grpc/keepalive"

var keepaliveClientParams = keepalive.ClientParameters{ Time: 30 * time.Second, // Send ping every 30s Timeout: 5 * time.Second, // Wait 5s for response PermitWithoutStream: true, // Ping even with no active streams }

conn, err := grpc.Dial( "grpc.example.com:443", grpc.WithKeepaliveParams(keepaliveClientParams), // ... other options ) ```

Java configuration:

```java // Server keepalive NettyServerBuilder.forPort(8080) .keepAliveTime(30, TimeUnit.SECONDS) .keepAliveTimeout(5, TimeUnit.SECONDS) .keepAliveMaxAge(30, TimeUnit.MINUTES) .permitKeepAliveTime(10, TimeUnit.SECONDS) .permitKeepAliveWithoutCalls(true) .maxConnectionIdle(15, TimeUnit.MINUTES) .build();

// Client keepalive NettyChannelBuilder.forAddress("grpc.example.com", 443) .keepAliveTime(30, TimeUnit.SECONDS) .keepAliveTimeout(5, TimeUnit.SECONDS) .keepAliveWithoutCalls(true) .build(); ```

### 3. Check load balancer timeout configuration

Load balancers must allow longer idle timeouts for gRPC:

```yaml # AWS ALB gRPC configuration # Target Group Attributes alb: target_group: attributes: - key: deregistration_delay.timeout_seconds value: "300" - key: stickiness.enabled value: "false"

# ALB idle timeout must exceed keepalive interval # Console: EC2 > Load Balancers > Attributes > Idle timeout = 400 seconds ```

Nginx proxy configuration:

```nginx # WRONG: Default timeouts kill gRPC streams location /grpc/ { proxy_pass http://backend; proxy_http_version 1.1; # WRONG - gRPC needs HTTP/2 }

# CORRECT: gRPC-aware proxy configuration location /grpc/ { proxy_pass http://backend;

# Enable HTTP/2 grpc_pass grpc://backend;

# Increase timeouts for streaming grpc_socket_keepalive on; proxy_socket_keepalive on;

# Timeout must exceed keepalive interval grpc_read_timeout 300s; grpc_send_timeout 300s;

# Buffer settings grpc_buffer_size 512k; proxy_buffering off;

# Headers proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } ```

Envoy proxy configuration:

```yaml # envoy.yaml static_resources: clusters: - name: grpc_backend type: STRICT_DNS http2_protocol_options: connection_keepalive: interval: 30s timeout: 5s upstream_connection_options: tcp_keepalive: keepalive_time: 300 keepalive_interval: 30 keepalive_probes: 3 connect_timeout: 5s lb_policy: ROUND_ROBIN

listeners: - address: socket_address: address: 0.0.0.0 port_value: 8080 filter_chains: - filters: - name: envoy.filters.network.http_connection_manager typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager http2_protocol_options: connection_keepalive: interval: 30s timeout: 5s route_config: virtual_hosts: - routes: - match: prefix: / route: cluster: grpc_backend timeout: 300s # Must exceed expected stream duration ```

### 4. Check deadline configuration

Deadlines must accommodate streaming duration:

```go // WRONG: Deadline too short for long streaming ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) defer cancel()

stream, err := client.LongRunningStream(ctx) // Will fail if stream runs longer than 10 seconds

// CORRECT: Set appropriate deadline or no deadline for indefinite streams ctx := context.Background() // No deadline for indefinite streaming stream, err := client.LongRunningStream(ctx)

// Or use per-message timeout ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute) defer cancel()

stream, err := client.BatchedStream(ctx) ```

Server-side deadline handling:

```go func (s *server) LongStream(req *Request, stream Service_LongStreamServer) error { ctx := stream.Context()

// Check deadline if deadline, ok := ctx.Deadline(); ok { log.Printf("Client deadline: %v", deadline) } else { log.Printf("No client deadline set - using server default") // Set server-side deadline to prevent resource exhaustion var cancel context.CancelFunc ctx, cancel = context.WithTimeout(ctx, 30*time.Minute) defer cancel() }

for { select { case <-ctx.Done(): return ctx.Err() // DEADLINE_EXCEEDED or CANCELED default: // Continue streaming if err := stream.Send(&Response{}); err != nil { return err } time.Sleep(1 * time.Second) } } } ```

### 5. Check for HTTP/2 protocol errors

HTTP/2 issues cause connection termination:

```bash # Capture HTTP/2 frames with tcpdump tcpdump -i any -s 0 -w grpc.pcap port 443

# Analyze with tshark tshark -r grpc.pcap -Y "http2" -T fields -e http2.type -e http2.error_code

# Look for GOAWAY frames with error codes: # 0 (NO_ERROR): Graceful shutdown # 1 (PROTOCOL_ERROR): Protocol violation # 2 (INTERNAL_ERROR): Internal bug # 6 (FRAME_SIZE_ERROR): Frame size exceeded # 7 (REFUSED_STREAM): Stream not processed # 8 (CANCEL): Stream canceled # 11 (ENHANCE_YOUR_CALM): Too many operations ```

Common HTTP/2 issues:

```go // Check for max concurrent streams limit // Default is often 100-256 concurrent streams per connection

// Server: Increase max concurrent streams server := grpc.NewServer( grpc.MaxConcurrentStreams(1000), // Increase from default )

// Client: Check connection is not saturated var wg sync.WaitGroup for i := 0; i < 1000; i++ { wg.Add(1) go func() { defer wg.Done() // Create stream }() } // If streams > max_concurrent_streams, new streams fail ```

### 6. Handle connection gracefully

Implement reconnection and backoff:

```go // Exponential backoff on connection failure import "google.golang.org/grpc/backoff"

connectParams := backoff.Config{ BaseDelay: 1.0 * time.Second, Multiplier: 1.6, MaxDelay: 120 * time.Second, }

conn, err := grpc.Dial( "grpc.example.com:443", grpc.WithConnectParams(grpc.ConnectParams{ Backoff: connectParams, MinConnectTimeout: 20 * time.Second, }), ) ```

Stream retry logic:

```go func streamWithRetry(ctx context.Context, client Client) error { maxRetries := 5 backoff := time.Second

for attempt := 0; attempt < maxRetries; attempt++ { stream, err := client.MyStream(ctx) if err != nil { log.Printf("Stream creation failed: %v", err) time.Sleep(backoff) backoff *= 2 // Exponential backoff continue }

for { resp, err := stream.Recv() if err == io.EOF { return nil // Stream completed successfully } if err != nil { log.Printf("Stream error: %v", err)

// Check if retryable if status.Code(err) == codes.Unavailable || status.Code(err) == codes.DeadlineExceeded { log.Printf("Retryable error, reconnecting...") break // Retry outer loop }

return err // Non-retryable error }

// Process response handleResponse(resp) }

time.Sleep(backoff) backoff *= 2 }

return fmt.Errorf("max retries exceeded") } ```

### 7. Check TLS/mTLS configuration

Certificate issues break connections:

```bash # Check certificate validity openssl s_client -connect grpc.example.com:443 -servername grpc.example.com \ < /dev/null 2>/dev/null | openssl x509 -noout -dates

# Check certificate chain openssl s_client -connect grpc.example.com:443 -showcerts

# Verify mTLS client certificates curl -v --cert client.crt --key client.key https://grpc.example.com/health ```

gRPC TLS configuration:

```go // Client with TLS import "google.golang.org/grpc/credentials"

creds, err := credentials.NewClientTLSFromFile( "ca-certificates.crt", "grpc.example.com", // Server name ) if err != nil { log.Fatalf("Failed to create TLS credentials: %v", err) }

conn, err := grpc.Dial("grpc.example.com:443", grpc.WithTransportCredentials(creds))

// Client with mTLS cert, err := tls.LoadX509KeyPair("client.crt", "client.key") if err != nil { log.Fatalf("Failed to load client certificate: %v", err) }

certPool, err := x509.SystemCertPool() if err != nil { log.Fatalf("Failed to get system cert pool: %v", err) }

tlsConfig := &tls.Config{ Certificates: []tls.Certificate{cert}, RootCAs: certPool, MinVersion: tls.VersionTLS13, }

creds := credentials.NewTLS(tlsConfig) ```

### 8. Monitor stream health

Add metrics for stream lifecycle:

```go // Prometheus metrics for gRPC streams var ( streamsActive = prometheus.NewGauge(prometheus.GaugeOpts{ Name: "grpc_streams_active", Help: "Number of active gRPC streams", })

streamsTotal = prometheus.NewCounter(prometheus.CounterOpts{ Name: "grpc_streams_total", Help: "Total number of gRPC streams created", })

streamErrors = prometheus.NewCounterVec(prometheus.CounterOpts{ Name: "grpc_stream_errors_total", Help: "Total gRPC stream errors by code", }, []string{"code"})

streamDuration = prometheus.NewHistogram(prometheus.HistogramOpts{ Name: "grpc_stream_duration_seconds", Help: "Duration of gRPC streams", Buckets: prometheus.DefBuckets, }) )

// Wrap stream handler with metrics func (s *server) InstrumentedStream(req *Request, stream Service_StreamServer) error { streamsActive.Inc() streamsTotal.Inc() defer streamsActive.Dec()

start := time.Now() err := s.MyStreamHandler(req, stream) duration := time.Since(start)

streamDuration.Observe(duration.Seconds())

if err != nil { code := status.Code(err).String() streamErrors.WithLabelValues(code).Inc() }

return err } ```

Alert thresholds: - grpc_streams_active dropping suddenly: Connection mass termination - grpc_stream_errors_total{code="Unavailable"} spike: Connectivity issues - grpc_stream_duration_seconds p99 increasing: Performance degradation

### 9. Check for network equipment timeouts

Firewalls and NAT devices drop idle connections:

```bash # Check firewall idle timeout # Common values: # - AWS Security Groups: 350 seconds for TCP # - Azure NSG: 4 minutes for TCP idle # - GCP Firewall: 10 minutes for TCP # - Corporate firewalls: Often 5-15 minutes

# Check NAT gateway timeout # AWS NAT Gateway: 350 seconds # Check CloudWatch: NatGatewayBytesOut, NatGatewayPacketsOut

# Test with long idle connection timeout 400 bash -c 'exec 3<>/dev/tcp/grpc.example.com/443; sleep 360' & # If connection drops before 360s, intermediate device has timeout

# Check TCP keepalive at OS level sysctl net.ipv4.tcp_keepalive_time sysctl net.ipv4.tcp_keepalive_intvl sysctl net.ipv4.tcp_keepalive_probes

# Enable TCP keepalive if not set sudo sysctl -w net.ipv4.tcp_keepalive_time=300 sudo sysctl -w net.ipv4.tcp_keepalive_intvl=30 sudo sysctl -w net.ipv4.tcp_keepalive_probes=5 ```

### 10. Implement graceful shutdown

Drain connections before server stop:

```go type gracefulServer struct { server *grpc.Server wg sync.WaitGroup quit chan struct{} }

func (gs *gracefulServer) Start() error { return gs.server.Serve(gs.listener) }

func (gs *gracefulServer) Stop(timeout time.Duration) error { close(gs.quit)

// Start graceful shutdown gs.server.GracefulStop()

// Wait for streams to complete with timeout done := make(chan struct{}) go func() { gs.wg.Wait() close(done) }()

select { case <-done: return nil case <-time.After(timeout): // Force stop if timeout exceeded gs.server.Stop() return fmt.Errorf("graceful shutdown timeout exceeded") } }

// In main() server := &gracefulServer{ server: grpc.NewServer(), quit: make(chan struct{}), }

// Handle SIGINT/SIGTERM sigChan := make(chan os.Signal, 1) signal.Notify(sigChan, os.Interrupt, syscall.SIGTERM)

go func() { <-sigChan log.Println("Shutting down gracefully...") if err := server.Stop(30 * time.Second); err != nil { log.Printf("Shutdown error: %v", err) } }() ```

Prevention

  • Set keepalive interval to 1/2 of shortest proxy timeout
  • Configure load balancer idle timeout > 5 minutes for gRPC
  • Use PermitWithoutStream: true for clients with long idle periods
  • Implement exponential backoff for reconnection
  • Monitor active stream count and error rates
  • Set appropriate deadlines based on expected stream duration
  • Use graceful shutdown with connection draining
  • Document network timeout requirements for operations team
  • **UNAVAILABLE: connection closed**: Connection terminated unexpectedly
  • **DEADLINE_EXCEEDED**: Stream exceeded deadline
  • **RESOURCE_EXHAUSTED: too many concurrent streams**: Max streams limit reached
  • **INTERNAL: HTTP/2 protocol error**: Protocol violation detected