PythonProgrammingPython Developer

Why do multiple **threads** in a single **Python** process execute sequentially on one CPU core despite the host system having multiple cores available?

Pass interviews with Hintsage AI assistant

Answer to the question

Python's Global Interpreter Lock (GIL) is a mutex that protects access to Python objects, ensuring that only one thread executes Python bytecode at any given moment. This design decision was made in CPython to simplify memory management and prevent race conditions on object reference counts. Consequently, even on multi-core processors, threads cannot run Python code in parallel; instead, they rapidly switch execution on a single core, making CPU-bound multithreading ineffective.

import threading import multiprocessing import time def cpu_intensive_task(n): """Pure Python CPU-bound operation""" count = 0 for i in range(n): count += i ** 2 return count # Demonstrating threading limitation start = time.time() threads = [threading.Thread(target=cpu_intensive_task, args=(5_000_000,)) for _ in range(4)] for t in threads: t.start() for t in threads: t.join() print(f"Threading time: {time.time() - start:.2f}s") # Output shows ~4x single-thread time due to GIL contention start = time.time() processes = [multiprocessing.Process(target=cpu_intensive_task, args=(5_000_000,)) for _ in range(4)] for p in processes: p.start() for p in processes: p.join() print(f"Multiprocessing time: {time.time() - start:.2f}s") # Output shows ~1x single-thread time (4x speedup)

Situation from life

Problem: Our analytics platform needed to process 10GB of log files with complex regex extractions and statistical calculations. The engineering team implemented a threading-based worker pool using concurrent.futures.ThreadPoolExecutor with 16 threads on a 16-core server. Surprisingly, CPU usage remained at 6-7% (one core) and processing took 3 hours, while sequential processing took 45 minutes. The GIL was forcing sequential execution plus adding thread-switching overhead.

Solution 1: Optimized threading with C extensions We evaluated moving heavy computation to NumPy operations and using C-accelerated libraries that release the GIL during execution.

Pros: Minimal code changes required; shared memory eliminates serialization costs; lower memory footprint since threads share address space.

Cons: Limited to operations supported by NumPy; custom algorithms still require Python bytecode execution; debugging C extension interactions adds complexity.

Solution 2: Process-based parallelism with multiprocessing We considered switching to multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor, spawning separate Python interpreters.

Pros: True parallelism utilizing all CPU cores; linear scalability for CPU-bound tasks; isolation prevents GIL contention completely.

Cons: Higher memory usage (each process loads separate Python interpreter ~50-100MB); data must be pickled/unpickled for inter-process communication; process startup latency overhead.

Chosen Solution: We selected multiprocessing with chunked data processing. Logs were split into 16 segments, processed by ProcessPoolExecutor, and results merged. The chunking strategy minimized pickle overhead by reducing IPC frequency.

Result: Processing time dropped from 3 hours to 4 minutes (45x speedup). CPU utilization reached 98% across all 16 cores. Memory usage increased by 800MB per process (12.8GB total), which was acceptable on our 128GB server. We implemented a process pool singleton to amortize startup costs across multiple batch jobs.

What candidates often miss


Why doesn't the GIL affect I/O-bound threading performance?

Many candidates mistakenly believe threads are useless in Python entirely. The GIL is released during I/O operations (network requests, disk reads, database queries) and when calling C extensions that explicitly release it (like NumPy matrix operations). When a thread blocks for I/O, other threads can execute Python code. Thus, threading remains highly effective for concurrent I/O handling, such as web scraping or handling thousands of simultaneous connections in asyncio-based servers.


Do alternative Python implementations like PyPy or Jython have a GIL?

Candidates often assume removing the GIL is simply a matter of using a different interpreter. PyPy (the JIT-compiled Python) also implements a GIL to maintain thread safety, though its different object model can make thread switching more efficient. However, Jython (running on JVM) and IronPython (running on .NET CLR) do not have a GIL because they rely on the underlying virtual machine's garbage collection and threading primitives, enabling true thread-level parallelism on JVM threads.


Can you release the GIL manually without spawning new processes?

Many developers are unaware of manual GIL management in C extensions. When writing Cython or C code, you can explicitly release the GIL using Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros around long-running computations. Additionally, Python 3.12+ introduced per-interpreter GIL (PEP 684), allowing sub-interpreters with separate GILs within one process, though this requires the experimental interpreters module and doesn't share objects between interpreters directly.