Dave Beazley's mondo computer blog. [ homepage | archive ]

Friday, August 12, 2011


An Inside Look at the GIL Removal Patch of Lore

As most Python programmers know, people love to hate the Global Interpreter Lock (GIL). Why can't it simply be removed? What's the problem? However, if you've been around the Python community long enough, you might also know that the GIL was already removed once before--specifically, by Greg Stein who created a patch against Python 1.4 in 1996. In fact, here's a link:

This patch often gets mentioned in discussions regarding the GIL--especially as justification for keeping the GIL. For example, see the Python FAQ, the forum post It Isn't Easy to Remove the GIL or this mailing discussion on Free Threading. These discussions usually point out that the patch made the performance of single-threaded applications much worse--so much so that the patch couldn't be adopted. However, beyond that, technical details about the patch are somewhat sketchy.

Despite using Python since early 1996, I will freely admit that I never really knew the details of Greg's GIL removal patch. For the most part, I just vaguely knew that someone had attempted to remove the GIL, that it apparently killed the performance of single-threaded apps, and that it subsequently faded into oblivion. However, given my recent interest in making the GIL better, I thought it might be a fun challenge to turn on the time machine, see if I could actually find the patch, and compile a version of GIL-less Python in order to take a peek under the covers.

So, in this post, I'll do just that and try to offer some commentary on some of the more interesting and subtle aspects of the patch--in particular, aspects of it that seem especially tricky or problematic. Given the increased interest in concurrency, the GIL, and other matters, I hope that this information might be of some use to others, or at the very least, help explain why Python still has a GIL. Plus, as the saying goes, those who don't study the past are doomed to repeat it. So, let's jump in.

Python 1.4

In order to play with the patch, you must first download and build Python-1.4. You can find it on python.org if you look long enough. After some bit of Makefile twiddling, I was able to build it and try a few things out on a Linux system.

Using Python-1.4 is a big reminder of how much Python has changed. It has none of the nice features you're used to to using now (list comprehensions, sets, string methods, sum(), enumerate(), etc.). In playing with it, I realize that about half of everything I type results in some kind of error. I digress.

Thread support in Python-1.4 is equally minimal. The threading.py module that most people know doesn't yet exist. Instead, there is just the lower-level thread module which simply lets you launch a thread, allocate a mutex lock, and not much else. Here's a small sample:

import thread
import time

def countdown(n):
    while n > 0:
          print "T-minus", n
          n = n - 1


Under the covers though, the implementation is remarkably similar to modern Python. Each thread consists of a C function that runs the specified Python callable and there is a GIL that guards access to some critical global interpreter state.

A Reentrant Interpreter?

If you read the threading.README file included in the patch, you will find this description:

These patches enable Python to be "free threaded" or, in other words, fully reentrant across multiple threads. This is particularly important when Python is embedded within a C program.

The stated goal of making Python "fully reentrant" is important so let's discuss.

All Python code gets compiled down to an intermediate "machine langauge" before it executes. For example, consider a simple function like this:

def countdown(n):
    while n > 0:
         print "T-minus", n
         n -= 1
    print "Blastoff!"

You can view the underlying low-level instructions using the dis module.

>>> import dis
>>> dis.dis(countdown)
  2           0 SETUP_LOOP              48 (to 51)
        >>    3 LOAD_FAST                0 (n)
              6 LOAD_CONST               1 (0)
              9 COMPARE_OP               4 (>)
             12 POP_JUMP_IF_FALSE       50

  3          15 LOAD_CONST               2 ('T-minus')
             18 PRINT_ITEM          
             19 LOAD_FAST                0 (n)
             22 PRINT_ITEM          
             23 PRINT_NEWLINE       

  4          24 LOAD_FAST                0 (n)
             27 LOAD_CONST               3 (1)
             30 INPLACE_SUBTRACT    
             31 STORE_FAST               0 (n)

  5          34 LOAD_GLOBAL              0 (time)
             37 LOAD_ATTR                1 (sleep)
             40 LOAD_CONST               3 (1)
             43 CALL_FUNCTION            1
             46 POP_TOP             
             47 JUMP_ABSOLUTE            3
        >>   50 POP_BLOCK           

  6     >>   51 LOAD_CONST               4 ('Blastoff!')
             54 PRINT_ITEM          
             55 PRINT_NEWLINE       
             56 LOAD_CONST               0 (None)
             59 RETURN_VALUE        

Now, here's the critical bit--as a general rule, most low-level interpreter instructions tend to execute atomically. That is, while executing an instruction, the GIL is held, and no preemption is possible until completion.

For a large majority of interpreter instructions, the lack of preemption is rarely a concern because they execute almost immediately. However, every now and then, a program may execute a long-running operation. Typically this happens if an operation is performed on a huge amount of data or if Python calls out to long-running C code that doesn't release the GIL.

Here is a simple example you can try to see it. Launch the above countdown function in its own thread like this (using the legacy thread module).

>>> import thread
>>> thread.start_new_thread(countdown,(20,))
T-minus 20
T-minus 19

Now, while that is running, do this:

>>> max(xrange(1000000000))

If you do this, everything should just grind to a halt. You will see no output from the countdown thread at all and Ctrl-C will be frozen. In a nutshell, this is the issue with preemption (or lack thereof). Due to the GIL, it's possible for one thread to temporarily block the progress of all other threads on the system.

The lack of preemption together with the GIL presents certain challenges when Python is embedded into a multithreaded C program. In such applications, the Python interpreter might be used for high-level scripting, but also for things such as event handling and callbacks. For example, maybe you have some C thread that's part of a game or visualization package that's calling into the Python interpreter to trigger event-handler methods in real time. In such code, control flow might pass from Python, to C, back to Python, and so forth.

Needless to say, it's possible that in such an environment, you might have a collection of C or C++ threads that compete for use of the Python interpreter and are forced to synchronize on the GIL. This means that the interpreter might become a bottleneck of the whole system. If, somehow, you could get rid of the GIL, then any thread would be allowed to use the interpreter without worrying about other threads. For example, a C++ program triggering a Python event callback, wouldn't have to concern itself with other Python threads---the callback would simply run without being blocked. This is what you get by making the interpreter fully reentrant.

It is in this context of embedding that the GIL removal patch should probably be viewed. At the time it was created, significantly more Python users were involved with integrating Python with C/C++ applications. In my own area of scientific computing, people were using Python to build interactive data visualization tools which often involved heavy amounts of CPU computation. I knew others who were tring to use Python for internal scripting of commercial PC video games. For all of these groups, removal of the GIL (if possible) was viewed as desirable because doing so would simplify programming and improve the user-experience (better responsiveness of the GUI, fewer stalls, etc.). If you've ever been sitting there staring at the spinning beachball on your Mac wishing it would just go away, well, then you get the general idea.

Exploring the Patch

If you download the free-threading patch, you will find that it is a relatively small set of files that replace and add functionality to Python-1.4. Here is a complete list of the modified files included in the patch:

listobject.h	pymutex.h	sysmodule.h
object.h	pypooledlock.h	threadstate.h

signalmodule.c	threadmodule.c

complexobject.c	intobject.c	longobject.c	stringobject.c
frameobject.c	listobject.c	mappingobject.c	tupleobject.c

Makefile.in	importdl.c	pythonrun.c	traceback.c
ceval.c		pymutex.c	sysmodule.c
errors.c	pypooledlock.c	threadstate.c

As you can see, it's certainly not a rewrite of the entire interpreter. In fact, if you run a diff across the Python-1.4 source and the patch, you'll find that the changes amount to about 1000 lines of code (in contrast, the complete source code to Python-1.4 is about 82000 lines as measured by 'wc').

I won't go into the details of applying or compiling the patch except to say that detailed instructions are included in the README should you want to build it yourself.

Initial Impressions

With the patch applied, I tried to do a few rough performance tests (note: I ran these under Ubuntu 8.10 on a dual-core VMWare Fusion instance running on my Mac). First, let's just write a simple spin-loop and see what happens:

import time
def countdown(n):
   while n > 0:
          n = n - 1

start = time.time()
end = time.time()
print end-start

Using the original version of Python-1.4 (with the GIL), this code runs in about 1.9 seconds. Using the patched GIL-less version, it runs in about 12.7 seconds. That's about 6.7 times slower. Yow!

Just to further confirm that finding, I ran the included Tools/scripts/pystone.py benchmark (modified to run slightly longer in order to get more accurate timings). First, with the GIL:

$ python1.4g Tools/scripts/pystone.py
Pystone(1.0) time for 100000 passes = 3.09
This machine benchmarks at 32362.5 pystones/second

Now, without the GIL:

$ python1.4ng Tools/scripts/pystone.py
Pystone(1.0) time for 100000 passes = 12.73
This machine benchmarks at 7855.46 pystones/second

Here, the GIL-less Python is only about 4 times slower. Now, I'm just slightly more impressed.

To test threads, I wrote a small sample that subdivided the work across two worker threads is an embarrassingly parallel manner (note: this code is a little wonky due to the fact that Python-1.4 doesn't implement thread joining--meaning that you have to do it yourself with the included binary-semaphore lock).

import thread
import time
import sys

def countdown(n,lck):
    while n > 0:
         n = n - 1
    lck.release()     # Signal termination

lck_1 = thread.allocate_lock()
lck_2 = thread.allocate_lock()
start = time.time()
lck_1.acquire()      # Wait for termination
end = time.time()
print end-start

If you run this code with the GIL, the execution time is about 2.5 seconds or approximately 1.3 times slower than the single-threaded version (1.9 seconds). Using the GIL-less Python, the execution time is 18.5 seconds or approximately 1.45 times slower than the single-threaded version (12.7 seconds). Just to emphasize, the GIL-less Python running with two-threads is running more than 7 times slower than the version with a GIL.

Ah, but what about preemption you ask? If you return to the example above in the section about reentrancy, you will find that removing the GIL, does indeed, allow free threading and long-running calculations to be preempted. Success!

Needless to say, there might be a few reasons why the patch quietly disappeared.

Under the Covers

Okay, the performance is terrible, but what is actually going on inside? Are there any lessons to be learned? A look at the source code and related documentation reveals all.

Capturing Thread State

For free-threading to work, each thread has to isolate its interpreter state and not rely on C global variables. The threading patch does this by defining a new data structure such as the following:

/* Include/threadstate.h */

typedef struct PyThreadState_s
    PyFrameObject *		current_frame;		/* ceval.c */
    int				recursion_depth;	/* ceval.c */
    int				interp_ticker;		/* ceval.c */
    int				tracing;		/* ceval.c */

    PyObject *			sys_profilefunc;	/* sysmodule.c */
    PyObject *			sys_tracefunc;		/* sysmodule.c */
    int				sys_checkinterval;	/* sysmodule.c */

    PyObject *			last_exception;		/* errors.c */
    PyObject *			last_exc_val;		/* errors.c */
    PyObject *			last_traceback;		/* traceback.c */

    PyObject *			sort_comparefunc;	/* listobject.c */

    char			work_buf[120];		/* <anywhere> */
    int				c_error;		/* complexobject.c */
} PyThreadState;

Essentially, all global variables in the interpreter have been picked up and moved into a per-thread data structure. Some of these values are obvious candidates such as exception information, tracing, and profiling hooks. Other parts are semi-random. For example, there is storage for the compare callback function used by list sorting and a global error handling variable (c_error) used to propagate errors across internal functions in the implementation of complex numbers.

To manage multiple threads, the interpreter builds a linked-list of all active threads. This linked list contains the thread-identifier along with the corresponding PyThreadState structure. For example:

/* Python/threadstate.c */

typedef struct PyThreadStateLL_s
    long                        thread_id;
    struct PyThreadStateLL_s *  next;
    PyThreadState               state;
} PyThreadStateLL;

Whenever a thread wants to get its per-state state information, it simply calls a function PyThreadState_Get(). This function scans the linked-list searching for the caller's thread-identifier. When found, the matching thread state structure is moved to the front of the linked list and the value returned. Here is a short example of code that illustrates an example use with the relevant bits highlighted:

/* Objects/listobject.c */
static int
cmp(v, w)
	const ANY *v, *w;
	object *t, *res;
	long i;
	PyThreadState *pts = PyThreadState_Get();

	if (err_occurred())
		return 0;

	if (pts->sort_comparefunc == NULL)
		return cmpobject(* (object **) v, * (object **) w);

	/* Call the user-supplied comparison function */
	t = mkvalue("(OO)", * (object **) v, * (object **) w);
	if (t == NULL)
		return 0;
	res = call_object(pts->sort_comparefunc, t);

Bits and pieces of the thread state code live on in Python3.2 today. In particular, per-thread state is captured in a data similar structure and there are functions for obtaining the state. In fact, there is even a function called PyThreadState_Get().

Fine-Grained Locking of Reference Counting

Memory management of Python objects relies on reference counting. In the C API, there are macros for manipulating reference counts:

/* Include/objects.h */
#define Py_INCREF(op) (_Py_RefTotal++, (op)->ob_refcnt++)
#define Py_DECREF(op) \
        if (--_Py_RefTotal, --(op)->ob_refcnt != 0) \
                ; \
        else \

These macros, along with their NULL-pointer safe variants Py_XINCREF and Py_XDECREF are used throughout the Python source. A quick search of the Python-1.4 source reveals about 250 uses.

With free-threading, reference counting operations lose their thread-safety. Thus, the patch introduces a global reference-counting mutex lock along with atomic operations for updating the count. On Unix, locking is implemented using a standard pthread_mutex_t lock (wrapped inside a PyMutex structure) and the following functions:

/* Python/pymutex.c */
PyMutex * _Py_RefMutex;
int _Py_SafeIncr(pint)
    int *pint;
    int result;

    result = ++*pint;
    return result;

int _Py_SafeDecr(pint)
    int *pint;
    int result;

    result = --*pint;
    return result;

The Py_INCREF and Py_DECREF macros are then redefined to use these thread-safe functions.

On Windows, fine-grained locking is achieved by redefining Py_INCREF and Py_DECREF to use InterlockedIncrement and InterlockedDecrement calls (see Interlocked Variable Access [MSDN]).

On Unix, it must be emphasized that simple reference count manipulation has been replaced by no fewer than three function calls, plus the overhead of the actual locking. It's far more expensive.

As a performance experiment, I decided to comment out the PyMutex_Lock and PyMutex_Unlock calls and run the interpreter in an unsafe mode. With this change, the performance of my single-threaded 'spin-loop' dropped from 12.7 seconds to about 3.9 seconds. The threaded version dropped from 18.5 seconds to about 4 seconds. [ Note: corrected due to an unnoticed build-error when trying this experiment initially ].

Clearly fine-grained locking of reference counts is the major culprit behind the poor performance, but even if you take away the locking, the reference counting performance is still very sensitive to any kind of extra overhead (e.g., function call, etc.). In this case, the performance is still about twice as slow as Python with the GIL.

Locking of Mutable Builtins

Mutable builtins such as lists and dicts need their own locking to synchronize modifications. Thus, these objects grow an extra lock attribute per instance. For example:

/* Include/listobject.h */
typedef struct {
	PyObject **ob_item;
} PyListObject;

Virtually all underlying methods (append, insert, setitem, getitem, repr, etc.) then use the per-instance lock under the covers.

A interesting aspect of the implementation is the way that locks are allocated. Instead of allocating a dedicated mutex lock for each list or dictionary, the interpreter simply keeps a small pool of available locks. When a list or dict first needs to be locked, a lock is taken from the pool and used as long as it is needed (until the instance is no longer being manipulated by any threads). At this point, the lock is simply released back to the pool.

This scheme greatly reduces the number of needed locks. Generally speaking, the number of locks is about the same as the number of active threads. Deeply nested data structures (e.g., lists of lists of lists) may also increase the number of locks needed if certain recursive operations are invoked. For example, printing a deeply nested data structure might cause a lock to be allocated for each level of nesting.

Locking of mutable containers does not involve anything more sophisticated than a mutex lock. For example, no attempt has been made to utilize reader-writer locks.

Locking of Various Internal Operations

In various places thoughout the interpreter, there are low-level book-keeping operations, often related to memory management and optimization. Their implementation often relies upon the use of unsafe C static variables.

For this, the patch defines a mutex lock dedicated generally to executing critical sections along with a pair of C macros for locking.

/* Include/pymutex.h */
#define Py_CRIT_LOCK()          PyMutex_Lock(_Py_CritMutex)
#define Py_CRIT_UNLOCK()        PyMutex_Unlock(_Py_CritMutex)

As an example of use, consider the int object. Due to the fact that integers are used so frequently, the underlying C code uses a custom memory allocator and other tricks to avoid excessive calls to C's malloc and free functions. Here is an example of the kind of thing you see in the patch (changes highlighted):

/* Objects/intobject.c */

static intobject *volatile free_list = NULL;
object *
	long ival;
	if (free_list == NULL) {
		if ((free_list = fill_free_list()) == NULL) {
			return err_nomem();
	v = free_list;
	free_list = *(intobject **)free_list;

A quick analysis of the patch shows that there are about 20 such critical sections that have to be protected. This includes low-level code in the implementation of ints, tuples, dicts as well as code generally related to the runtime of the interpreter (e.g., module imports, signal handling, sys module, etc.).

Although all of these sections share the same lock, the associated overhead appears to be negligible compared to that of reference counting.

Other Tricky Bits

Although the patch addresses some fundamental issues needed to make free-threading work, it is only a small start. In particular, no effort has been made to verify the thread-safety of any standard library modules. This includes a large body of C code that would have be audited in detail to identify and fix potential race conditions.

Certain parts of the Python implementation also remain problematic. For example, certain low-level C API functions such as PyList_GetItem() and PyDict_GetItem() return borrowed references (e.g, objects without an increased reference count). Although it seems remote, there is a possibility that such functions could return a reference to an object that then gets destroyed by another thread.

Final Words

Looking at the patch has been an interesting trip through history, but is there anything to learn from it? This is by no means an exhaustive list, but a few thoughts come to mind:

That's about it for now. I hope you found this trip through time interesting. In a future installment, I'll explore the problem of pushing locked reference counting as far as it can possibly go. As a preview, a simple patch involving less than a dozen lines of code makes the whole GIL-less Python run more than twice as fast. However, can it go even faster? Stay tuned.

A bitbucket or github repo containing Python 1.4 with the free threading patch applied and build fixes for modern systems would be greatly appreciated! (Doubly so if it includes the future experiments you allude to in improving performance.)

Thanks for a fantastic article!
You are right that reference counting is a problem for multi-threaded performance, but mutexes are not the problem. If it were, we could just as well use atomic read/write when accessing them.

The major problem in this context is "false sharing". As one processor updates a referene counts, the cache line containing this reference is dirty for all other processors using it. Thus, they must stop whatever they are doing to synchronize their cache with RAM. Even if we take the GIL away, "false sharing" will act as a hidden GIL behind the scenes.

There are three possible solutions:

- Use another GC mechanism, i.e. a tracing garbage collector similar to Java.

- Use another threading model, i.e. no shared objects (message-passing only).

- Use processes instead of threads for parallel computing (e.g. the multiprocessing module).

Any serious number crunshing (which is the only situation for which the GIL really matters) is likely to use C or Fortran performance libraries. They can release the GIL or spawn threads internally. Using Python gives a 200x to 1000x performance hit compared to C for compute-bound tasks. So there are good reasons for considering using libraries for this. Using libraries for speed-critical bottlenecks has much larger potential for speed-ups than multi-threading, and it also takes the GIL-problem with threads away (they can release the GIL or spawn threads internally).

Cython gives us manual control over the GIL if we need it. Thus, porting performance-critical parts from Python to C is not very tedious. The most recent Cython also supports OpenMP for parallel computing.

The GIL is also no problem if we use multiple processes (e.g. multiprocessing, os.fork or MPI). In addition we avoid any contention from shared memory. Most people write lousy multi-threaded programs, so processes nearly always scale better.

Nobody likes the GIL, but it does not do what most of the FUD claims. Python is used on some of the word's largest parallel computers. It's merely that the GIL is not very understood in the Python community. It does not prevent us from using more than one CPU. It prevents threads from concurrently accessing the CPython interpreter. But processes can still run in parallel (they don't share Python interpreter, nor do they share GIL) and libraries can still spawn threads or release the GIL while they do lengthy tasks.

It's time to stop all the uneducated FUD about Python's GIL. It's annoying, but it does not do what most people think. Python threads are not fibers, nor do they behave like fibers, and Python programs can use multi-core CPUs.
This isn't FUD. Python's lack of preemption means it is not possible to write a reliable low latency server (really important for web services) that has a single shared Python interpreter in the critical path of serving any response.

This isn't just about being able to consume multiple cores and scale raw computations up. It limits your design choices for user facing services.

It is quite easy to write a python statement with GIL-held execution time determined by the data itself that works fine in most cases but explodes and causes a painfully slow unresponsive server in others.

The defense to that kind of problem in today's CPython VM is to destroy the beauty of the language and always code in size checks in front of any operation, avoid generators, etc.

Or give up and use multiple processes or, more likely, give up and don't write your server in Python.

If you think the purpose of this article is to spread FUD about the GIL, then you obviously don't know me very well. The whole point of this post was to simply look at a piece of interesting historical work.


P.S. I'm not disagreeing with you, but I practically invented the use of Python with Supercomputers back in 1996 (and presented it at the Python conference). As such, I really don't need to be told about it now. Thanks.
Having preemptible instructions, even without any performance gain, would already be very useful.

An example of the data visualisation you mention is IPython, which uses PyOS_InputHook to try and provide interactivity with Matplotlib plots. Only, when you execute a long-running for-loop, the plot is frozen.
> Python's lack of preemption means it is not possible to write a reliable low latency server (really important for web services) that has a single shared Python interpreter in the critical path of serving any response.

Not true. Consider Twisted and other asynchronous networking libraries. You must write your application in a non-blocking style (libraries aid this, but they are not magic, discipline must be used). Anything that is long running must either be some C library that you have verified releases the GIL (C libraries for database access do this, etc.) and have at least one thread for using such C libraries, or you must make use of at least one other process running another Python run-time.

And, people do this. People ship today making use of asynchronous coding in Python as described above. It is only willful ignorance that prevents anyone from communicating with communities that ship such applications to ask for help.

That said, it is fascinating to read an exploration of the exact engineering trade-offs (and historical baggage to be dealt with) involved in implementing other ways for ensuring robust efficient multi-threaded code.

So far, it seems that if the CPython interpreter was designed to be able to drop in other garbage-collectors, that would have facilitated more options for multi-threaded code. But I would not trade really-existing CPython (a version exists for anyplace OS threads are supported) for a speculative perfect CPython (existing only in comment threads).
No threading on the superboard, huh? I guess that's why you had to fall back to plan B (the mac)! Poor superboard. Poor, poor superboard.
Hey, for me GIL is not a problem, and isn`t a bottleneck, for my purposes current speed of python and threading module is acceptable, for me its a source of crash of my app with embedded python...i know that crashes happens because of my potentially wrong usage of c-api....cases like mine are one more source of FUD, people get errors, people go and blame GIL...current python c-api docs are too match declarative, thats good, but i am experiencing lack of tutorials about usage of c-api...Dave mentioned a saint graal features: "event handling and callbacks" which i`m unsuccessfully attempting to address

very nice article, and research. Looking forward to the next installments.

There are modern methods in which atomic reference counting is also very fast. QT, and the linux kernel prove that atomic reference counting can be very fast.

A side note... if objects are only accessed by the same thread then you don't have the cache issues. Note, any language which accesses the same object from multiple cpus will be slower. It's best if you limit the contention on data, and that can be done at the application level design of programs.
Has anyone tried using lock-free semantics when removing the gil? Seems like a CAS would be better suited to reference counting than the windows synchronization primitives.
I bet you're going for STM for reference count, it will eliminate all the locking in typical case.
This comment has been removed by the author.

With Python 2.x, we only have cooperative multitasking with Python threads. We do have preemptive multitasking when using multiprocessing on operating systems that supports it (e.g. Windows, Linux, FreeBSD).

With multiprocessing we have multiple Python interpreters running in parallel, each with it's own GIL, but is that really a problem?

Using multiple processes instead of threads is usually a good design. Cf. the Apache web server, Chrome web browser, and IE 9. Multiple Python processes not only gives us pre-emptive multitasking, they give less contention (no shared objects) and no "false sharing" (unless we explicitely use shared memory for IPC). Multiple processes are also inherently safer, particularly for network clients and servers. On the World's largest super computers, multiple processes (MPI) is the only game in town. Granted, many of them are clusters, but even on SMPs multiple processes (MPI) tend to beat the crap out of multithreading (OpenMP) performance wise. That is due to what threads and their shared memory tends to do with cache consistency. Instead of computing, the CPUs get busy synchronizing cache with RAM (the shared memory), as all CPUs are reading and writing to the same memory. In the worst case, the memory bus will be jammed and we are better off using a single thread. With multiple processes we don't have to think about that, unless we them use shared memory.

IPC between processes (pipes, Unix domain sockets, Windows named pipes) have an overhead which is hardly more than a memcpy in kernel. Shared memory for IPC has the same overhead as a memcpy in user space. Multiple processes can even synchronize with a spinlock on a shared memory address, and it is not slower than multiple threads spinlocking on a shared variable. In both cases it's just atomic read or write to a volatile variable, and a Sleep(0) (in Windows API lingo) to release the reminder of the time slice. I don't know who started the rumor that "IPC is slow", but he should have been flogged. IPC is very fast. The major communication overhead in Python's multiprocessing module is the Queue serializing Python objects with pickle. But if we need fast IPC, we can tailor a fast queue that to our needs. And example is the IPC in MPI, where mpi4py use shared memory (or whatever the MPI implementation uses) to send NumPy arrays between processes.

Another way to construct low-latency servers using Python 2.x is asynchronous design. Twisted is a huge library for that. There are other options as well, such as poll/select on Linux (in Python standard lib) or I/O completion ports on Windows (e.g. using pywin32).

The "new GIL" in Python 3.x is actually designed to allow pre-emtive multitasking with threads. That means, the intention was to eventually allow one thread to preempt another from the interpreter. I don't remeber if preemptive multitasking is implemented yet (it's been a couple of years since I looked at it), but it was one of the things discussed when the new GIL for Python 3 was designed. Another issue was to prevent "GIL battles" by reducing contention for the GIL. Preemptive multitasking with Python threads is coming to Python, even with the GIL. It is the implementation of the GIL in Python 2 that mandated cooperative multitasking for Python threads, not a "global interpreter lock" per se. Even if the interpreter only allows one thread, there is nothing that dictates this thread cannot be preempted out by a higher priority thread. What we need for that, is a way to let a higher prioritized thread steal the GIL. And for this purpose, the GIL is not any longer implemented as a simple mutex, it is a much more complex structure involving an event object. Introducing preemptive multitasking for Python threads is much easier than removing the GIL. The new GIL in Python 3 is designed to make this possible. And if it's not there yet, it is coming to future versions of CPython.

It is not that easy, because a single heap allocator (libc malloc) means that objects can lay in the same cache line even though they aren't shared. To reduce this kind of contention, each Python thread would need to allocate from a private heap.

This is a common design pattern on Windows (cf. the HeapAlloc system call) but not on Linux/Unix, because fork() has traditionally been used instead of threads. Implementing multiple heaps on Unix will mean more work than on Windows, because Unix applications traditionally just use one global heap.

Yes it is possible to design Python's memory allocator to use multiple heaps (e.g. one per thread). It would make it easier to finally remove the GIL. It would even improve multithreaded performance in presence of the GIL. But currently Python threads share the global process heap through malloc.

Thread-local heaps is one of the things I mentioned in the recent discussion on python-dev list.

Also it was a bit futile to argue for a Windows design pattern on python-dev, where almost everybody use Linux...

With all due respect, your description of the new Python 3 GIL is pretty far off the mark. For one, it does not address any of the low-level preemption issues described in this post (i.e. having fully preemptable interpreter ticks). As such, it does not simplify removal of the GIL in any way whatsoever.

The only real difference between the Python 2 and Python 3.2 GIL is a change in the mechanism used to preempt CPU-bound threads. In Python 2, it's based on interpreter ticks (e.g., thread switching after every 100 ticks). In Python 3.2, it's based on time (e.g., thread switching after every 5ms). The latter approach solves some multicore GIL-battle issues, but other other than that, the two GILs are virtually identical in what they do.

Also, the "preemption" you refer to with the new Python 3 GIL pertains to a possible priority scheme that allows I/O bound threads to preempt CPU-bound threads. That is certainly a useful thing to have, but it's not in any way related to the low-level preemption issues being discussed in the blog post.
Dave, please read this:


Particularly the part about 'forced thread switching' and 'priority requests'. While not implemented in Python yet, these are the building stones for preemptive thread scheduling.

That was later discussed on python-dev, and one of the reasons the 'new GIL' is designed with 'time intervals' and "wake-up calls".

As I remember, Antoine removed preemptive 'priority requests' from the 'new GIL' in Python 3.2. That doesn't mean 'priority requests' cannot be introduced in a later version after more thorough testing.

I am quite familiar with all of the discussion, implementation, and other details of the new GIL. The forced preemption and priorities you're talking about have nothing whatsoever to do with making a fully reentrant interpreter or preemptible ticks as discussed in this post. Sorry.
Guido recently posted this message that describes, in more detail, problems with C and embedding. http://mail.python.org/pipermail/python-dev/2011-August/112854.html.

This is the primary issue that the GIL removal patch seems to be addressing (e.g.,full reentrancy of the interpreter).
This is a fantastic post, Dave! Thanks for digging in and writing it up.

Some context for why I did this may explain some of the choices: we were writing a Python-based server on Windows. Thus, we were pretty much forced into threads for performances (rather than processes, as Sturla recommends). This server was embedded within a C-based server (which ran as an NT Service) and communicated with an upstream IIS. When the request arrived, it was passed down into the Python application. Each of these requests were independent, so I did not have to worry about objects shared across threads (with the resulting issues, as you note). I was concerned with allocating a mutex for every Python object (which maps to a HANDLE in Windows), so I came up with the pools to minimize the number of active locks.

On the performance, note that Windows' Interlocked primitives are pretty darned fast. Non-Windows atomic primitives were not readily available back in 1996, so the portable solution needed to use pthread mutexes (to great harm, as you noticed). Today, atomics are pretty universal (even built into recent GCCs), so the INCREF/DECREF could switch to those with great benefit.

In our performance testing, we noted that one core was slower, but two cores were faster than a GIL-based Python. But performance on 3 and 4 cores was slower and heading downwards pretty fast. My conclusion was lock contention somewhere, but didn't have time to find out where. Recall that we speak about "cores", but in 1996 these were multiprocessor machines. Not super common, but the primary point was to provide a reliable response across requests rather than make each as fast as possible. Our application was generally network- and database-bound. The performance of the interpreter was a small portion of the overall mix, so its speed wasn't super critical.

Guido picked up the ThreadState concept and incorporated that into Python 1.5.2 (ie. long before your observation of the state in 3.x :-). That is also where we got sys.exc_info(), since sys.last_* has race conditions accessing your per-thread exception information.

Removing the GIL in modern times will be much, much harder. There are many more pools of objects (like the ints and tuples that I had to deal with). At one point, I thought about a generalized facility for those caches. *shrug*

For extension modules, you could allocate a flag that indicates it is "safe" to run within a GIL-less Python. The free-threading interpreter could refuse to load any module that has not been specifically marked. I don't have a ready answer for Python modules, however. I might suggest an import hook that whitelists reviewed modules.

Again... thanks for writing up this post. Wonderful stuff!

(and feel free to reach me at gstein at gmail, if you've got any questions; I'll try to keep an eye on this post's comments for a while, too...)
I do appreciate the work you did for this article. And I like the comments from the others and you that help clarify issues you raise. It was very enlightening. Thanks.
This may be a 'left-field' question but coming from the opposite direction, how much would a 'backward-incompatible Python' need to change/lose in order to gain a) GIL-less and b) higher performance by being more JIT-friendly like Lua ?
Also in a hundred-core minimum future does the GIL become more or less relevant ?
@shak I can't really comment on JITs, but you should definitely check out PyPy for more information. As for cores, I suspect that by the time Python sheds its GIL, people will have also discovered that programming with threads is a really lousy way to write scalable parallel software (something that parallel computing folks already knew in the 1990s ;-).

Personally, I think it's fine if the GIL stays although there are still some improvements that could be made.
I have experimented with interlocked increment and decrement on Linux yesterday (in gcc they are called __sync_add_and_fetch(ing*,1) and __sync_sub_and_fetch(int*,1) ).
On Atom N270 pystone benchmark were 6500 (with patch) vs 7500 (using GIL). Single threaded test script consumes 11 seconds (patch) vs 8 seconds (GIL). Multithreaded script with patched python consumes 7 seconds!!!! It is even faster than with GIL.
I have also done some experiments with the gcc atomics as your describe. I don't get quite the same numbers you describe (for me, the GIL-less Python is still slower with and without threads). However, the overall performance of the patched version improves dramatically doing this. I hope to say more in a future post.
FWIW, any similarity between Greg's code in that patch and the thread state structure in current Python is not accidental -- Greg gave me that code as a set of patches around the time of his GIL-removal patch.
From a user perspective : why Stackless efforts aren't ported to CPython ? Doesn't Stackless solve all this stuff ? PyPy still uses GIL.
As part of my pyparallel work, I created a github repo for Python 1.4 then applied Greg's patch in a no-gil branch via this commit: https://github.com/tpn/python-1.4/commit/5c17e33224e2b57948ab594280a006d7648347e6

Post a Comment

Subscribe to Post Comments [Atom]

<< Home


Prior Posts by Topic

08/01/2009 - 09/01/2009   09/01/2009 - 10/01/2009   10/01/2009 - 11/01/2009   11/01/2009 - 12/01/2009   12/01/2009 - 01/01/2010   01/01/2010 - 02/01/2010   02/01/2010 - 03/01/2010   04/01/2010 - 05/01/2010   05/01/2010 - 06/01/2010   07/01/2010 - 08/01/2010   08/01/2010 - 09/01/2010   09/01/2010 - 10/01/2010   12/01/2010 - 01/01/2011   01/01/2011 - 02/01/2011   02/01/2011 - 03/01/2011   03/01/2011 - 04/01/2011   04/01/2011 - 05/01/2011   05/01/2011 - 06/01/2011   08/01/2011 - 09/01/2011   09/01/2011 - 10/01/2011   12/01/2011 - 01/01/2012   01/01/2012 - 02/01/2012   02/01/2012 - 03/01/2012   03/01/2012 - 04/01/2012   07/01/2012 - 08/01/2012   01/01/2013 - 02/01/2013   03/01/2013 - 04/01/2013   06/01/2014 - 07/01/2014   09/01/2014 - 10/01/2014  

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]