Introduction
The requests.ConnectionError with a timeout message is one of the most common failures in Python services that call external APIs. Unlike requests.Timeout which fires after a successful TCP connection with slow response, ConnectionError occurs when the TCP handshake itself fails -- meaning the remote server is unreachable, DNS resolution fails, or the connection is actively refused. In production environments with unreliable downstream services, wrapping every request in retry logic with exponential backoff is essential for resilience.
Symptoms
Your application logs show errors like:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='api.external-service.com', port=443): Max retries exceeded with url: /v1/data (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9c2c1a3d90>: Failed to establish a new connection: [Errno 110] Connection timed out'))Or on Windows:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.external-service.com', port=443): Max retries exceeded with url: /v1/data (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001A2B3C4D5E0>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time'))The application hangs for 30+ seconds per request before failing, causing cascading timeouts in upstream callers.
Common Causes
- Network partitions: Temporary network glitches between your server and the external API
- Downstream service restart: The target service briefly goes down during a rolling deployment
- DNS resolution failures: DNS cache expired and the resolver is temporarily unreachable
- Connection limits hit: The remote server has hit its connection limit and drops new connections
- Firewall or security group rules: A recent infrastructure change blocked outbound traffic
- No retry configured: Using
requests.get()directly without any retry mechanism
Step-by-Step Fix
Step 1: Add retry with exponential backoff using urllib3
The requests library uses urllib3 under the hood. You can configure a Retry object and mount it to a session:
```python import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry
def create_resilient_session(): session = requests.Session()
retry_strategy = Retry( total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504], allowed_methods=["HEAD", "GET", "OPTIONS", "POST"], raise_on_status=False, )
adapter = HTTPAdapter(max_retries=retry_strategy) session.mount("https://", adapter) session.mount("http://", adapter)
return session ```
The backoff_factor of 1 means retries happen after 1s, 2s, 4s (exponential). The status_forcelist ensures retries on transient HTTP errors. The allowed_methods list includes POST because many APIs are idempotent for retry.
Step 2: Set explicit connect and read timeouts
Never use requests.get(url) without a timeout. Always specify both connect and read timeouts as a tuple:
```python session = create_resilient_session()
try: response = session.get( "https://api.external-service.com/v1/data", timeout=(3.05, 10), # (connect_timeout, read_timeout) ) response.raise_for_status() data = response.json() except requests.exceptions.ConnectionError as e: logger.error("Connection failed after retries: %s", e) raise except requests.exceptions.Timeout as e: logger.error("Request timed out after retries: %s", e) raise ```
The connect timeout of 3.05 seconds is a known best practice -- it's just above a multiple of 3, which avoids collision with TCP retransmission intervals.
Step 3: Use a circuit breaker for persistent failures
When a downstream service is completely down, retrying every request wastes resources. Add a circuit breaker:
```python from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=30) def fetch_external_data(url): session = create_resilient_session() response = session.get(url, timeout=(3.05, 10)) response.raise_for_status() return response.json() ```
After 5 consecutive failures, the circuit opens and subsequent calls fail immediately without hitting the network, for 30 seconds.
Prevention
- Always use
requests.Session()with retry adapters for production code - Set both connect and read timeouts on every request -- the default is no timeout
- Monitor retry rates in your APM dashboard; a spike indicates downstream degradation
- Use circuit breakers to prevent cascading failures when a service is down
- Consider implementing a fallback cache that serves stale data during outages
- Set up alerting on
requests.exceptions.ConnectionErrorrates exceeding baseline