A deep dive into GIL, concurrency, and parallelism in Python

Symphony

August 7th, 2024

Python is a high-level, general-purpose programming language known for its simplicity and readability. It is dynamically typed and garbage-collected. Its versatility allows it to be used in various domains, including web development, data analysis, artificial intelligence, and automation. An extensive standard library and a vast ecosystem of third-party packages further contribute to its popularity.

Just like any language, Python defines a set of syntax rules and guidelines that govern its usage. These rules cover aspects such as syntax structure, naming conventions, reserved keywords, data types, control flow structures, functions, and modules. These guidelines ensure that Python code is not only easy to understand but also executable. All of these rules can be, and are, implemented in different programming languages.

The primary implementation of the Python programming language is CPython, written in C, which serves as the default interpreter for Python code. Often, when we are talking about Python, we are referring to this implementation. Other implementations include Jython, written in Java, running on the Java Virtual Machine (JVM) for seamless integration with Java code; IronPython, implemented in C#, which enables Python code to interact with .NET libraries and applications; and PyPy, written in a subset of Python called RPython, which boasts a just-in-time (JIT) compiler for improved performance.

Please notice that in this article, we will focus on CPython implementation, and any mention of Python refers to CPython.

Executing a Python program typically involves two to three main phases, depending on how the interpreter is invoked:

Initialization - involves setting up necessary structures for the Python process, especially important for non-interactive command prompt execution
Compiling - entails converting the source code into a form the Python interpreter can understand, including creating syntax trees, symbol tables, and generating code objects
Interpreting - involves the interpreter executing the bytecode generated from the compiled code objects within a specific context, following the program's logic. When you execute a Python script or interact with Python code through an interpreter, the Python Virtual Machine (PVM) reads the bytecode and executes it, producing the desired output or performing specified actions. During code execution on the PVM, we encounter the famous or infamous Global Interpreter Lock (GIL)

Before discussing the GIL, I'd like to address another important topic: concurrent and parallel execution/processing. Concurrent and parallel processing are often used interchangeably, but they refer to slightly different concepts in computing.

Concurrent processing involves handling multiple tasks or processes simultaneously but not necessarily executing them at the same time. Instead, a concurrent system may interleave the execution of tasks to give the appearance of parallelism. In concurrent processing, tasks can be executed in a single-core system through techniques like multitasking or time-sharing, where the processor rapidly switches between executing different tasks. In a multi-core system, concurrent processing can involve executing multiple tasks simultaneously across different cores.

An example of concurrent processing is a web server handling multiple requests concurrently. Requests might appear to be processed simultaneously, but they are actually interleaved by the server's scheduler.

Parallel processing, on the other hand, involves executing multiple tasks simultaneously, typically by utilizing multiple processing units or cores. In parallel processing, tasks are truly executed concurrently, with each task assigned to a separate processing unit or core.

Global Interpreter Lock (GIL)

You may ask yourself, what is GIL and why do we have it? By definition, the Global Interpreter Lock is a mechanism used in computer language interpreters to synchronize the execution of threads so that only one native thread (per process) can execute basic operations (such as memory allocation and reference counting) at a time. As a general rule, an interpreter that uses the GIL will see only one thread execute at a time, even if it runs on a multi-core processor.

While this might seem like a bottleneck for multi-threaded performance, it simplifies memory management and makes it easier to integrate with existing C libraries. Additionally, this approach increases the speed of single-threaded program execution because there is no need to acquire or release locks on all data structures separately.

In Python version 3.2, the GIL logic was improved. Threads attempting to acquire the GIL initially wait on a condition variable until it's released, with a timeout set to the switch interval. Upon waking up, the requesting thread checks if any context switches have occurred. If not, it sets a volatile flag, gil_drop_request, shared among all threads, indicating its request to release the GIL. This process repeats until the thread successfully acquires the lock, re-requesting GIL drop after each delay when a new thread obtains it.

The thread holding the GIL aims to release it during blocking operations or, within the evaluation loop, checks if gil_drop_request is set, releasing the GIL if true. This action wakes up waiting threads, relying on the operating system for fair scheduling among them.

This approach provides a means to limit the time a thread holds the GIL by delaying the setting of gil_drop_request. Additionally, it allows the evaluation loop to process bytecode instructions at its own pace, minimizing overhead when no other thread requests the GIL.

More technical explanation of GIL can be found in the documentation:

The GIL is just a boolean variable (gil_locked) whose access is protected by a mutex (gil_mutex), and whose changes are signaled by a condition variable (gil_cond). The gil_mutex is taken for short periods of time, and is therefore mostly uncontended. In the GIL-holding thread, the main loop (PyEval_EvalFrameEx) must be able to release the GIL on demand by another thread. A volatile boolean variable (gil_drop_request) is used for that purpose and is checked at every turn of the eval loop. That variable is set after a wait of interval microseconds on gil_cond has timed out. [Actually, another volatile boolean variable (eval_breaker) is used, which ORs several conditions into one. Volatile booleans are sufficient as inter-thread signaling means since Python is run on cache-coherent architectures only.] A thread wanting to take the GIL will first let a given amount of time pass (interval microseconds) before setting gil_drop_request. This encourages a defined switching period, but does not enforce it since opcodes can take an arbitrary time to execute. The interval value is available for the user to read and modify using the Python API sys.{get,set}switchinterval(). When a thread releases the GIL and gil_drop_request is set, that thread ensures that another GIL-awaiting thread gets scheduled. It does so by waiting on a condition variable (switch_cond) until the value of gil_last_holder is changed to something else than its own thread state pointer, indicating that another thread has taken GIL. This prohibits the latency-adverse behavior on multi-core machines where one thread would speculatively release the GIL, but still, run and end up being the first to re-acquire it, making the “timeslices” much longer than expected.

In short, when a thread wants to run, it needs to take the GIL. I/O operations cause the GIL to be dropped so that another thread can be executed. This is called cooperative multitasking. If the running thread does not release the GIL, then it can be signaled to drop it after some interval of microseconds. This is called preemptive multitasking. This mechanism is very important because some CPU-bound threads can abuse the GIL’s possession.

When the interpreter start ups, a single main thread of execution is created, and there is no contention for the GIL since there are no other threads around, so the main thread does not bother to acquire the lock. The GIL comes into play after other threads are spawned. Spawned threads can be classified into one of two categories:

CPU bound - these threads use the CPU intensively, performing tasks such as mathematical calculations, sorting algorithms, or data processing. In CPU-bound scenarios, the thread spends most of its time performing computations and requires minimal input/output (I/O) operations
I/O bound - these threads frequently get blocked because of I/O operations, leaving the CPU idle. Examples include reading from or writing to files, network communication, or database queries. In I/O-bound scenarios, the thread spends most of its time waiting for input/output operations to complete, rather than performing computations

It is important to note that Python 3.13 introduces the option to build and use Python without the GIL (--disable-gil). While this is a promising development supported by major companies, it is still in its early stages. Therefore, proceed with caution if you plan to use it in your projects. For additional details, refer to PEP 703.

Multithreading

For operations with threads in Python, we use the threading module. This module is loosely based on Java’s threading model. However, whereas Java makes locks and condition variables the basic behavior of every object, they are separate objects in Python. Python’s Thread class supports a subset of the behavior of Java’s Thread class; currently, there are no priorities, no thread groups, and threads cannot be destroyed, stopped, suspended, resumed, or interrupted.

As mentioned above, the key class of the threading module is the Thread class, which represents an activity run in a separate thread of control. There are two ways to specify the activity: by passing a callable object to the constructor, or by overriding the run() method in a subclass. Once a thread object is created, its activity must be started by calling the thread’s start() method. This invokes the run() method in a separate thread of control.

Once the thread’s activity is started, the thread is considered “alive”. It stops being alive when its run() method terminates, either normally, or by raising an unhandled exception.

Other threads can call a thread’s join() method. This blocks the calling thread until the thread whose join() method is called is terminated. A thread can be flagged as a “daemon thread”. The significance of this flag is that the entire Python program exits when only daemon threads are left.

When we combine the aforementioned GIL and threads, the execution is illustrated in the diagram below. The GIL permits only one thread to be executed at a time. This limitation arises because when a thread begins running, it acquires the GIL. However, during any wait for I/O operations (such as reading/writing data from/to disk) or CPU-bound operations (like vector/matrix multiplication), it releases the lock.

In reality, we achieve cooperative computing rather than parallel computing. This cooperative approach enables better utilization of CPU resources and responsiveness in multi-threaded Python programs by allowing other threads to execute during I/O-bound or CPU-intensive tasks. While it doesn't achieve true parallelism, cooperative computing can still improve performance and concurrency in many scenarios.

When your program involves tasks that primarily wait for input/output operations, such as reading from or writing to files, network communication, or database queries, threading can be highly beneficial. By using threads, you can overlap I/O operations, improving overall performance and responsiveness. Keep in mind that GIL may limit the parallelism of CPU-bound tasks, so we should consider alternative approaches like multiprocessing. Even the official threading documentation has a notice regarding this:

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

Multiprocessing

To bypass the GIS limitations, a commonly used method is to adopt a multi-processing approach instead of multithreading. In multiprocessing, several processes are utilized rather than threads.

Python has a native multiprocessing package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively sidestepping the GIL by using subprocesses instead of threads.

When you're starting processes with the multiprocessing package, you have a few options. You can use “spawn”, which starts a fresh interpreter process. The child process will only inherit the resources necessary to run the process object’s run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Alternatively, you can use “fork”, which makes child processes identical to the parent and is available on UNIX-like systems. There's also “forkserver”, which sets up a server process to handle creating new processes. Whenever a new process is needed, the parent process connects to the server and requests that it fork a new process.

Python 3.12 made some changes to the default starting method, moving away from “fork” because of its limitations. Now, you need to specify the start method you want explicitly to ensure your code runs smoothly.

In some cases, we need to enable communication between processes (IPC). For this, the multiprocessing package provides a couple of mechanisms. These mechanisms allow processes to exchange data, coordinate their execution, and synchronize their activities. We will not go into details about all of them, just mention the three most important: pipes, queues, and manager objects.

A common method of communication is through pipes, which provide a unidirectional channel for data flow between two processes. With the Pipe() function, Python developers can create a pair of connection objects representing the endpoints of the pipe. Data written to one end can be read from the other.

Queues are thread-and process-safe data structures that support multiple producers and consumers. Processes can put items into the queue using the put() method and retrieve them using get(), enabling easy exchange of data between concurrent tasks.

To facilitate sharing more complex Python objects, such as lists or dictionaries, between processes, Python offers Manager objects. These objects create a shared server process that manages Python objects, allowing other processes to manipulate them safely. Manager objects provide a convenient way to share stateful data structures across multiple processes without worrying about synchronization issues.

While using a multiprocessing package allows you to leverage multiple CPU cores more effectively, especially for CPU-bound tasks, it comes with additional expenses compared to threading. Each created process has its own memory space. This means that if you're dealing with large amounts of data, multiprocessing will consume much more memory compared to threads, which share memory within the same process.

Inter-process communication (IPC) is typically more expensive than inter-thread communication. When using multiprocessing, data needs to be serialized and deserialized when passing between processes, which can introduce overhead, especially for large data structures.

Creating a new process typically takes longer than creating a new thread due to the overhead involved in spawning a new process, including duplicating the parent process's memory space and initializing the interpreter. Multiprocessing introduces additional complexity compared to threading, as you need to manage separate processes and potentially deal with synchronization and communication between them.

Summary

When deciding between multithreading and multiprocessing for task execution, it's essential to consider the nature of the task and the specific requirements of your application.

For I/O-bound tasks, such as reading from or writing to files, network communication, or database queries, threading is often a suitable choice. Threads can efficiently overlap I/O operations, improving overall performance and responsiveness without significant overhead.

On the other hand, multiprocessing is generally preferred for CPU-bound tasks that involve heavy computation or parallel processing. Unlike threads, which are limited by Python's GIL, multiprocessing allows tasks to run concurrently on multiple CPU cores, effectively utilizing available hardware resources. This makes multiprocessing ideal for tasks like mathematical computations, data processing, or image manipulation, where parallelism can significantly improve performance.

About the author

Armin, an experienced Software Engineer from our Sarajevo Branch, excels in full-stack development with proficiency in Python, Java, Node.js, Angular, and JavaScript. He has successfully delivered N-tier applications for startups and multi-tenant microservices for corporates on both cloud and on-premises. Armin is known for adopting new technologies and producing high-quality, business-aligned software with a focus on code simplicity and clear separation of concerns.

Contact us if you have any questions about our company or products.

We will try to provide an answer within a few days.

I agree to the Terms & Conditions and Privacy Policy

A deep dive into GIL, concurrency, and parallelism in Python

Global Interpreter Lock (GIL)

Multithreading

Multiprocessing

Summary

About the author

You may be interested in these articles:

Flutter: A practical guide to cross-platform development

Tips for a better developer experience with Temporal

Modern data platforms and scalability