Introduction

SQLAlchemy connection pool deadlock manifests when all connections in the pool are checked out and no new connections can be created. Unlike a true OS-level deadlock, this is a resource exhaustion scenario where threads or coroutines hold connections indefinitely while waiting for other connections, creating a circular wait condition. In production under load, this error brings the entire application to a halt as every database operation queues up waiting for a connection that will never become available.

Symptoms

The application hangs and eventually throws:

bash
sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 0 reached, connection timed out, timeout 30.00

Or with queue logging enabled:

bash
sqlalchemy.pool.impl.QueuePool INFO: Connection pool exhausted, waiting for available connection (3 threads waiting)
sqlalchemy.pool.impl.QueuePool INFO: Pool size: 10, Overflow: 0, Checked in: 0, Checked out: 10

Application monitoring shows database query latency spiking to the pool timeout value (30 seconds by default), with request queuing behind the database layer.

Common Causes

  • Connections not returned to pool: Forgetting to call session.close() or not using context managers for sessions
  • Long-running transactions holding connections: A single transaction spanning multiple external API calls holds its connection the entire time
  • Pool size too small for concurrency: Default pool_size=5 with 20 worker threads means 15 threads will block immediately under load
  • Connection leaks in error paths: An exception raised after engine.connect() but before conn.close() leaves the connection checked out
  • Deadlocked database transactions: Two transactions waiting on row locks in opposite order hold their connections while the database resolves the deadlock
  • Using NullPool accidentally: The NullPool creates a new connection for every request and closes it immediately, which can exhaust the database server's max connections

Step-by-Step Fix

Step 1: Configure pool sizing correctly for your workload

```python from sqlalchemy import create_engine from sqlalchemy.pool import QueuePool

engine = create_engine( "postgresql+psycopg2://user:pass@localhost/mydb", poolclass=QueuePool, pool_size=20, # Persistent connections in pool max_overflow=10, # Extra connections allowed beyond pool_size pool_timeout=30, # Seconds to wait before raising TimeoutError pool_recycle=1800, # Recycle connections after 30 minutes pool_pre_ping=True, # Verify connection before each use ) ```

The rule of thumb: pool_size should match the number of concurrent database operations, not the number of threads. For web applications, start with pool_size = (worker_count * 2) + 1.

Step 2: Always use session context managers

```python from contextlib import contextmanager from sqlalchemy.orm import sessionmaker

Session = sessionmaker(bind=engine)

@contextmanager def get_session(): session = Session() try: yield session session.commit() except Exception: session.rollback() raise finally: session.close() # Always returns connection to pool

# Usage - connection always returned to pool with get_session() as session: user = session.query(User).filter_by(id=1).first() user.last_login = datetime.utcnow() session.commit() ```

Step 3: Diagnose connection leaks with pool status logging

```python import logging from sqlalchemy import event

logging.basicConfig() logging.getLogger("sqlalchemy.pool").setLevel(logging.INFO)

@event.listens_for(engine, "checkout") def on_checkout(dbapi_conn, connection_rec, connection_proxy): logging.info( "Connection checked out. Pool status: size=%d, overflow=%d, checked_out=%d", engine.pool.size(), engine.pool.overflow(), engine.pool.checkedout(), )

@event.listens_for(engine, "checkin") def on_checkin(dbapi_conn, connection_rec): logging.info( "Connection checked in. Pool status: size=%d, overflow=%d, checked_out=%d", engine.pool.size(), engine.pool.overflow(), engine.pool.checkedout(), ) ```

This reveals which code paths are checking out connections without checking them back in.

Step 4: Break long transactions into smaller units

```python # BAD - holds connection for the entire duration with get_session() as session: user = session.query(User).get(user_id) external_data = call_slow_external_api(user.email) # Connection held here user.profile = external_data session.commit()

# GOOD - release connection during external calls with get_session() as session: user = session.query(User).get(user_id) user_data = {"email": user.email, "name": user.name}

external_data = call_slow_external_api(user_data["email"])

with get_session() as session: user = session.query(User).get(user_id) user.profile = external_data session.commit() ```

Prevention

  • Set pool_pre_ping=True to detect stale connections before use
  • Monitor pool.checkedout() metric in production alerting
  • Use pool_timeout=10 (not 30) to fail fast rather than hanging for 30 seconds
  • Enable SQLAlchemy echo mode in staging to trace connection checkout patterns
  • Use SHOW max_connections on PostgreSQL to ensure pool_size + max_overflow does not exceed database limits
  • Set pool_recycle to less than your database server's wait_timeout to avoid stale connections