Introduction
The Python Global Interpreter Lock (GIL) ensures only one thread executes Python bytecode at a time. For I/O-bound tasks this is fine since threads release the GIL during network or disk operations. For CPU-bound tasks like numerical computation, data transformation, or image processing, threads provide no speedup and can actually slow things down due to context-switching overhead.
Symptoms
- Multithreaded CPU-bound code runs no faster than single-threaded
- Adding more threads makes performance worse
topshows only one CPU core at 100% while others idle- Profile reveals threads spending time waiting for GIL acquisition
```python import threading import time
def compute(): result = 0 for i in range(100_000_000): result += i * i return result
start = time.time() threads = [threading.Thread(target=compute) for _ in range(4)] for t in threads: t.start() for t in threads: t.join() print(f"4 threads: {time.time() - start:.2f}s") # Result: ~12s (slower than sequential ~3s on single core) ```
Common Causes
- Using
threading.Threadfor CPU-intensive calculations - Assuming Python threads work like OS threads for parallelism
- Data processing pipelines using threads instead of processes
- Image/video processing code using thread pools
Step-by-Step Fix
- 1.Switch to ProcessPoolExecutor for CPU-bound work:
- 2.```python
- 3.from concurrent.futures import ProcessPoolExecutor
def compute(): result = 0 for i in range(100_000_000): result += i * i return result
with ProcessPoolExecutor(max_workers=4) as executor: futures = [executor.submit(compute) for _ in range(4)] results = [f.result() for f in futures] # Result: ~3s total (true parallelism across 4 cores) ```
- 1.Use multiprocessing for shared data scenarios:
- 2.```python
- 3.from multiprocessing import Process, Array
- 4.import ctypes
def compute_chunk(shared_result, start, end): total = 0 for i in range(start, end): total += i * i shared_result.value = total
if __name__ == '__main__': result = Array(ctypes.c_longlong, 4) chunk = 25_000_000 procs = [Process(target=compute_chunk, args=(result[i], i*chunk, (i+1)*chunk)) for i in range(4)] for p in procs: p.start() for p in procs: p.join() ```
- 1.Use NumPy for vectorized operations (releases GIL internally):
- 2.```python
- 3.import numpy as np
arr = np.arange(100_000_000) result = np.sum(arr * arr) # Runs in C, GIL released ```
- 1.Consider Cython or native extensions for critical paths:
- 2.```python
- 3.# cython_gil.pyx
- 4.# cython: boundscheck=False, wraparound=False
- 5.def compute_cython(long n):
- 6.cdef long long result = 0
- 7.cdef long long i
- 8.with nogil:
- 9.for i in range(n):
- 10.result += i * i
- 11.return result
- 12.
`
Prevention
- Profile with
threading.stack_size()andsys.getswitchinterval()to understand GIL behavior - Use
concurrent.futureswith clear process vs thread separation - Use
time.perf_counter()to benchmark before and after threading changes - Consider
multiprocessing.shared_memoryfor large data sharing between processes - For Python 3.13+, the free-threaded build (PEP 703) can eliminate the GIL entirely