Introduction

The Python Global Interpreter Lock (GIL) ensures only one thread executes Python bytecode at a time. For I/O-bound tasks this is fine since threads release the GIL during network or disk operations. For CPU-bound tasks like numerical computation, data transformation, or image processing, threads provide no speedup and can actually slow things down due to context-switching overhead.

Symptoms

  • Multithreaded CPU-bound code runs no faster than single-threaded
  • Adding more threads makes performance worse
  • top shows only one CPU core at 100% while others idle
  • Profile reveals threads spending time waiting for GIL acquisition

```python import threading import time

def compute(): result = 0 for i in range(100_000_000): result += i * i return result

start = time.time() threads = [threading.Thread(target=compute) for _ in range(4)] for t in threads: t.start() for t in threads: t.join() print(f"4 threads: {time.time() - start:.2f}s") # Result: ~12s (slower than sequential ~3s on single core) ```

Common Causes

  • Using threading.Thread for CPU-intensive calculations
  • Assuming Python threads work like OS threads for parallelism
  • Data processing pipelines using threads instead of processes
  • Image/video processing code using thread pools

Step-by-Step Fix

  1. 1.Switch to ProcessPoolExecutor for CPU-bound work:
  2. 2.```python
  3. 3.from concurrent.futures import ProcessPoolExecutor

def compute(): result = 0 for i in range(100_000_000): result += i * i return result

with ProcessPoolExecutor(max_workers=4) as executor: futures = [executor.submit(compute) for _ in range(4)] results = [f.result() for f in futures] # Result: ~3s total (true parallelism across 4 cores) ```

  1. 1.Use multiprocessing for shared data scenarios:
  2. 2.```python
  3. 3.from multiprocessing import Process, Array
  4. 4.import ctypes

def compute_chunk(shared_result, start, end): total = 0 for i in range(start, end): total += i * i shared_result.value = total

if __name__ == '__main__': result = Array(ctypes.c_longlong, 4) chunk = 25_000_000 procs = [Process(target=compute_chunk, args=(result[i], i*chunk, (i+1)*chunk)) for i in range(4)] for p in procs: p.start() for p in procs: p.join() ```

  1. 1.Use NumPy for vectorized operations (releases GIL internally):
  2. 2.```python
  3. 3.import numpy as np

arr = np.arange(100_000_000) result = np.sum(arr * arr) # Runs in C, GIL released ```

  1. 1.Consider Cython or native extensions for critical paths:
  2. 2.```python
  3. 3.# cython_gil.pyx
  4. 4.# cython: boundscheck=False, wraparound=False
  5. 5.def compute_cython(long n):
  6. 6.cdef long long result = 0
  7. 7.cdef long long i
  8. 8.with nogil:
  9. 9.for i in range(n):
  10. 10.result += i * i
  11. 11.return result
  12. 12.`

Prevention

  • Profile with threading.stack_size() and sys.getswitchinterval() to understand GIL behavior
  • Use concurrent.futures with clear process vs thread separation
  • Use time.perf_counter() to benchmark before and after threading changes
  • Consider multiprocessing.shared_memory for large data sharing between processes
  • For Python 3.13+, the free-threaded build (PEP 703) can eliminate the GIL entirely