Roger B. Dannenberg | Fast Python

Summary

The disparity between the fast development time of dynamically-typed languages like Python and their slow run-time performance is has been a constant source of both frustration and investigation. I did some experiments to measure a prototype “Python to C++” compiler and report on my findings. I believe there is an interesting avenue for investigation of a new kind of language where a few type declarations for objects and primitive types (int, real, string) could deliver dramatic performance improvements with minor impact on the usabily or “feel” of these otherwise dynamically typed languages.

About the Author

Roger B. Dannenberg is Emeritus Professor of Computer Science at Carnegie Mellon University. He is known for a broad range of research in Computer Music, including the creation of interactive computer accompaniment systems, languages for computer music, music understanding systems, and music composing software. He is a co-creator of Audacity, perhaps the most widely used music editing software.

The title is a bit of a teaser. This post is about making Python run fast, but the bottom line is that pure Python does not run nearly as fast as C++. It is worth looking into why this is so and whether we can get much higher performance and still get the “quality” of Python, whatever that is.

My real interest is Serpent, a Python-like language I developed for real-time computer music applications. The advantages of Serpent are mainly the ability to run multiple instances in a single process (Python now supports multi-processing, but the mechanism is much heavier than Serpent’s) and a real-time garbage collector. Other things I like about Serpent are “real” object methods (this does not appear as a parameter, and you access members with membername instead of this.membername), implicit colons in most cases (I’m always forgetting the colon after if, def, and for statements, and who needs them?), and usually no need for backslash (\) when statements wrap to the next line (the compiler figures it out). Python and Serpent are similar enough that whatever works to speed up Serpent will work for Python and vice versa, although Python is larger so more work is involved.

How Slow Is Python?

My rule of thumb is Python is about 100x slower than C or C++. This is just a rough figure and it depends on what you are doing. In particular, if your program in C++ would spend 99% of the time in compiled library routines (e.g. networking, graphics, machine learning), then rewriting the other 1% in Python makes your program only 1.99x ≈ 2x slower. That may not be significant. But let’s assume speed matters and see what we can do about it.

A real advance in Python implementation is PyPy, which claims to be 4.8x faster than the more common CPython implementation. I'm not sure why PyPy is not the “standard” Python version. But still, 4.8x faster leaves you about 20x slower than C++, so is that the best we can hope for?

Why is Python Slow?

Compared to C++, there are several features of Python that make it slower:

Interpreter. CPython translates source code to byte codes and executes them with a virtual machine that must read a byte-encoded instruction, jump to a function that implements the instruction, update a pointer to the next byte and return to the top of the instruction execution loop. Meanwhile, since machine data registers are used to manage all this, the virtual machine data registers or stack must be kept in main memory. I would guess most of this is in the level-1 cache, but still, there must be real hardware instructions to address and fetch the data from the cache. In short, using software to do what chips do in hardware is much slower.
Dynamic Types. Python variables can hold any type, so almost every primitive operation must check for valid parameters and perform type-dependent operations. E.g. every + operator in Python gets compiled to the same byte code, but the implementation of that byte code must decide at runtime whether to add two integers, two doubles, an integer and a double, concatenate two arrays, splice two strings, or something else. That is a lot of decision making that happens for every operator.
Method Lookup. In C++, when you call an object’s method, the compiler may know the object’s class, so it can jump directly to the method. In other cases (virtual methods), the compiler follows a pointer from the object to a method table, fetches a method pointer from an offset known at compile time, and jumps to the method. In constrast, Python does not know the class, so it follows a pointer from the object to its class, then does some kind of lookup using a string name as key to find the method. Serpent does a little better because methods are symbols (think of symbols as unique strings), allowing the search key to be the unique address of the symbol. Symbols can be compared with a single pointer equality test, whereas Python string names require a test for string equality, but either way, there is a penalty for runtime dictionary lookup.
Field (Instance Variable) Lookup. Similarly, when C++ accesses a member variable (I prefer the more generic term instance variable, which I think originated with Smalltalk), it can load the variable from a fixed, known offset within the object. In contrast, Python stores instance variables in a dictionary and must do a runtime lookup using the string name as key. Serpent again does a little better at runtime by declaring instance variables and assigning them a fixed location within the object. Within a method, instance variable references compile to instructions containing the known offset, avoiding a dictionary lookup at runtime.
Reference Counting or Garbage Collection. C++ leaves it to the programmer to free memory that has been allocated. This is definitely a burden on programmers, but usually programmers can devise efficient schemes for freeing memory. Python automatically frees memory when it can no longer be referenced. This is achieved by reference counts that keep track of how many references exist for each object. This requires that the reference count be updated each time a reference is created, copied or modified, adding extra overhead to variable assignments, parameter passing, etc. Serpent uses a “real” tracing garbage collector. There are no reference counts, but there is overhead to notify the garbage collector when references are modified. This usually consists of checking some “color” bits indicating white, black or gray. Garbage collection has the advantages that it is more incremental and thus more real-time and it can reclaim cyclical structures where all objects have non-zero reference counts but the objects cannot be referenced from the program.

Given these problems in Python, it is no wonder C++ is much faster. We have to ask if there are ways to eliminate or reduce these problems, either by clever implementation or by language design changes. We could then decide if the implementation effort or language changes give us an overall improvement. E.g. if you believe type hints in Python are a good thing, maybe you would be even happier making type specification a requirement of the language, especially if it would give a large speedup. So let’s look at some options and see what kind of speedup can be achieved.

Benchmark

To measure speedup, we need a benchmark, a deterministic program that we can implement in different ways and see how fast it runs. Ideally, we could have a whole suite of benchmarks emphasizing different aspects of the language or different kinds of computation. I am using a very small benchmark, the n-bodies program, from Benchmark Games. I cannot defend this as the “right” benchmark, but at least it has a good mix of structure or object field access, conditionals, floating point and integer operations, and array operations. It does not emphasize function calls, but in my opinion, if function calls are your big problem, you can probably in-line function calls in the time-critical code, so I am not so concerned about this. Similarly, you could worry about string manipulation, but a lot of string manipulation takes place in libraries for pattern matching, searching, etc., and that is something I am willing to delegate to C++, e.g. by making the functions built-in operations of the language.

Complete Typing

My first thought was: How bad would it be to require type specifications everywhere, enabling a more-or-less direct translation to C++? I wrote a prototype cross-compiler from Serpent (with types) to C++. Writing a compiler is not a small task, but by leveraging C++, I was able to avoid a complete language parser, and instead just convert whole lines of Serpent code to C++. For C++, you need to add braces around sub-blocks and take care of a number of other language differences. The compiler is about 900 lines of Serpent.

The main idea that makes this work is adding type specifications to Serpent. For example, this Serpent code constructs all pair-wise combinations of the elements of list l:

def combinations(l):
    result = []
    for i = 0 to len(l) - 1:
        var x = l[i]
        for j = i + 1 to len(l):
            result.append(Pair(l[i], l[j]))
    return result

And here is the same program rewritten with type specifications and that serves as input to our “CSerpent to C++” compiler:

Pairs combinations(Bodies l):
    Pairs result = array(Pairs, 0)
    for Int i = 0 to len(l) - 1:
        Body x = l[i]
        for Int j = i + 1 to len(l):
            result.append(Pairs(l[i], l[j]))
    return result

This compiles to this C++ code:

Pairs combinations(Bodies l) {
    Pairs result = Pairs_create(0, 0);
    for (Int i = 0; i < len(l) - 1; i += 1) {
        Body x = l[i];
        for (Int j = i + 1; j < len(l); j += 1) {
            result->append(Pair_create(l[i], l[j]));
        }
    }
    return result;
}

The complete implementation can be found here for Python, Serpent, CSerpent, i.e. Serpent code with types, C++ compiled from CSerpent, and a reference hand-coded C++ version.

Let’s run them and see what performance we get:

Language	Run Time	Relative C++	Speedup Over Python
C++	4.0 sec	1x	131x
C++ (compiled from CSerpent)	14.7 sec	3.7x	36x
Python	522 sec	131x	1x
Serpent	543 sec	136x	0.96x

Although the experiment compiling typed Serpent (I will call it CSerpent) to C++ gives impressive speedup, the typing system seems ugly and cumbersome. Perhaps it is just a question of language design, but constructing and naming all the types, especially arrays and dictionaries of objects, seems cumbersome. E.g. in my CSerpent implementation, all types must be simple identifiers, so you use a new command typedef and constructors like array_of(Body) to write a definition such as:

typedef Bodies = array_of(Body)

You can see the use of Bodies in the CSerpent example above. This does not seem so bad, but there is a Dictionary that maps planet names (strings) to objects of type Body. What do I name that? I used String_body_di, but that is as ugly as the C++ Map as a type. In retrospect, I could have called it Body_table or Body_dict and it might be more acceptable. I am on the fence here: Can a language have the nice feel of Python if everything is statically typed? This got me to thinking that arrays and dictionaries could have elements of any type (something like Objective-C containers which simply contain object references), while types could be used elsewhere. Or maybe types could be optional. Either approach requires the implementation of single type that can store numbers, strings and objects. I name the type Any and give a few details below.

Compiled with Dynamic Typing

My next thought was that if I can implement Any, why not go back to dynamic typing by declaring everything to be of type Any? I could use the same compiler from CSerpent to C++ and gain some speedup, but avoid the need to declare types.

At least it is a good experiment. We can see how much speed is lost due to dynamic typing versus other design decisions.

My type Any uses “nan-boxing” similar to values in Serpent’s implementation. A value is represented by a 64-bit unsigned int. The high order bits determine whether the value represents a 50-bit signed integer, a 48-bit object pointer (whose class is indicated by a pointer in the object), a 48-bit string pointer or a double. On current machines, the high-order bits of addresses are zero, so 48 bits is enough. The main trick here is that doubles do not use all possible bit patterns in the high-order bits, so the unused patterns can serve as codes that the low-order bits should be interpreted as pointers of different kinds and integers. Serpent even has a special case for short strings (up to 5 bytes and a zero terminating byte) contained entirely within the 64-bit value. Of course, there is overhead to decode the type and convert the type, e.g. ints must be sign-extended to form a 64-bit signed int and pointers must have the high-order bits cleared to form valid addresses.

The implementation of Any also requires operator overloading so we can write a + b even if a and/or b is of type Any. Similarly, built-in functions for array indexing, dictionary lookup, etc., need to have versions that take parameters of type Any, check for valid types, decode the types and run the function.

Finally, the compiler cannot compile method calls or instance variable access on values of type Any. My solution is to translate, say, foo.fn(x) into foo.call(FN_MSG, x) where FN_MSG is a unique value (I used a global char * initialized to "fn"). Every class implements its own call method that uses if/else if/else if/... code to select the proper method. My prototype classes have few methods, so this linear search is probably optimal. Of course, a more sophisticated dictionary could be implemented. Instance variables are implemented in the same way, but within methods, the variables are simply named, and C++ accesses the variable at a known offset, so there is no need to search for it.

I do not have a fully working compiler for this approach, but I was able to construct a working benchmark in a way that could be translated automatically. Here is what the compiled combinations routine looks like in this version:

Any combinations(Any l) {
    Any result(new Array_class);
    int ub = l.call(SIZE_MSG).must_get_int();
    for (int i = 0; i < ub - 1; i += 1) {
        Any any_i(i);
        int ub2 = l.call(SIZE_MSG).must_get_int();
        for (int j = (any_i + 1).must_get_int(); j < ub2; j += 1) {
            Any any_j(j);
            result.call(APPEND_MSG, Pair_create(l[any_i], l[any_j]));
        }
    }
    return result;
}

Notice everything is declared Any except for for loop variables, which are introduced by the compiler and guaranteed to be integers by the language definition. (Loops work a little differently in Serpent than Python, but I think the general approach would work in either language.) Since the actual type represented by Any is variable, functions require dynamic lookup and dispatch. For example, result.append(...) is compiled to result.call(APPEND_MSG, ...);.

The run time for this approach, added to our previous table gives us:

Language	Run Time	Relative to C++	Speedup Over Python
C++	4.0 sec	1x	131x
C++ (compiled from CSerpent)	14.7 sec	3.7x	36x
C++ (from CSerpent with dynamic type Any)	129 sec	32x	4.0x
Python	522 sec	131x	1x
Serpent	543 sec	136x	0.96x

As shown in the middle row, this compiled dynamic typing approach gives a solid improvement over Python, but it is a long way from the performance with static typing.

Declared ints and floats

In Serpent, I often find myself declaring a local variable with int by accident (all locals in Serpent should be declared with var, unlike Python's implicit declaration of locals upon assignment). What would happen if we declared float and integer types but left everything else to dynamic typing? This would be a minimal burden on the programmer.

I manually altered the fully dynamic version based on type Any by declaring all integers and floats as such. The highlighted line in this updated table shows the results:

Language	Run Time	Relative to C++	Speedup Over Python
C++	4.0 sec	1x	131x
C++ (compiled from CSerpent)	14.7 sec	3.7x	36x
C++ (from CSerpent with mixed Any, double, and int types)	95 sec	22x	5.8x
C++ (from CSerpent with dynamic type Any)	129 sec	32x	4.0x
Python	522 sec	131x	1x
Serpent	543 sec	136x	0.96x

There is some speedup with declarations for float and integer types, but the improvement is limited by the many field (instance variable) accesses, which are to type Any object references. These require dynamic lookup of fields, and since the fields could be declared to be any type in any object, the return result must be Any. So we not only pay for dynamic lookup but also to encode and decode the float from type double to type Any and back. Converting back requires a type check (consider that it could be an int and we could be validly multiplying by an int). Also notice we are already faster than PyPy's claim to 4.8x over CPython. A little typing can go a long way.

But this version is still 22x slower than C and not even 10x faster than Serpent or Python.

Analysis and Discussion

Going back to the 5 reasons Python is slow, what can we say about them now?

Interpreter. Comparing interpreted to compiled Serpent (using type Any for everything), we see a ratio of 4x (this is actually the ratio of Python to compiled Serpent, but there is only a 4% difference between Serpent and Python on this benchmark, so I will consider Serpent and Python to be equal). It's surprising to me that the virtual machine apparently accounts for 75% of Python compute time given how much additional time is required for conditional code depending on operand types. A lot of work has gone into making byte-code interpreters faster, and C++'s relatively new Labels as Values feature might help create faster virtual machines without resorting to assembly. But the improvements have an upper bound of 4x unless the dynamic typing problem is addressed.
Dynamic Types. Once we remove instruction interpretation overhead, our code is still 32x slower than C++, which means we only spend about 3% of the time doing “real” work. The rest is due to dynamic typing, which we can break down into methods, instance variable access, and variables. In my last experiment mixing Any, int and double types, the improvement over using Any for all types was slight (speed increased 45% from 4x to 5.8x over Python). In this benchmark, classes are used as structures containing fields x, y, z, vx, vy, vz and m, and even though they are stored as type double, the compiler does not have the type information to avoid decoding the object reference from Any, dynamically finding the instance variable, and encoding it into an Any which must then be decoded immediately to a double. If code is rewritten to use all local variables and if we introduce pure arrays or tuples of doubles, then we can get much more speedup. Alternatively, a just-in-time compiler that generates specialized code to handle the case of all doubles and even in-line access to instances should perform well here.
Method Lookup. There is no useful information on the cost of method lookup in these experiments because methods are not called very often. In Objective-C, method lookup is said to be 3x to 6x the cost of function calls, and given that method calls in critical inner loops are not hard to find and eliminate (possible exception being a compare method in a genetic sort routine), this may not be a critical problem.

Field (Instance Variable) Lookup. Methods have the property that after you deal with dynamic typing to resolve and invoke the method, you get to run the method code where you know the object type and the types of all its members, potentially amortizing the overhead of dynamic typing. On the other hand an expression like some_object.some_field is a bigger problem. First, if you do not know the type of some_object, you need to decode it. Second, since you did not know the object type, you could not look up the location of some_field at compile time. A second dynamic lookup is required. And third, after all that work, the compiler does not know the field (instance variable) type, so that type must be decoded to perform any operation with it.

This is the main difference between the “compiled from CSerpent” and the “from CSerpent with mixed Any, double and int types” versions. In the former, we had object types and therefore we could resolve instance variable locations and types. In the latter, every expression like b1.x required two dynamic lookups and resulted in type Any which required a further conditional to test for a double. The difference in speed was a factor of 6, so we can say, roughly, that 84% of the runtime is devoted to dynamic typing of objects and instance variables.

Notice that we could replace all instance variable accesses of the form b1.x with “getter” functions like b1.get_x(), because within the method, we can know the offset of x at compile time. This does not help, though, because now we have to do a dynamic lookup to find the get_x method.

One way to regain some lost performance is to make type declarations for classes optional, e.g. consider this Serpent from nbody.srp:

    for pair in PAIRS:
        var b1 = pair.b1
        var b2 = pair.b2
        var dx = b1.x - b2.x
        var dy = b1.y - b2.y
        var dz = b1.z - b2.z
        e = e - (b1.m * b2.m) / pow(dx * dx + dy * dy + dz * dz,  0.5)

We could make a small addition to declare pair to be an instance of Pair:

    for Pair pair in PAIRS:
        var b1 = pair.b1
        var b2 = pair.b2
        var dx = b1.x - b2.x
        var dy = b1.y - b2.y
        var dz = b1.z - b2.z
        e = e - (b1.m * b2.m) / pow(dx * dx + dy * dy + dz * dz,  0.5)

With this simple change, the compiler can determine from Pair that pair.b1 is a Body (if the instance variable b1 is declared that way), and that b1.x and others are floats. This cuts the number of dynamic lookups from 10 to 1, so that 84% overhead is cut to 8.4%, and extrapolating to the whole program, we might get down 5.3x C++ performance, or a 25x speedup over Python. This deserves another table:

Language	Run Time	Relative to C++	Speedup Over Python
C++	4.0 sec	1x	131x
C++ (compiled from CSerpent)	14.7 sec	3.7x	36x
C++ (from CSerpent with object and primitive type declarations - estimated)	21 sec	5.3x	25x
C++ (from CSerpent with mixed Any, double, and int types)	95 sec	22x	5.8x
C++ (from CSerpent with dynamic type Any)	129 sec	32x	4.0x
Python	522 sec	131x	1x
Serpent	543 sec	136x	0.96x

Reference Counting or Garbage Collection. In my experiments, I used reference counting. I did not use C++ smart pointers because they are 16 bytes (a double pointer object), and instead implemented reference counting with a pointer to an abstract class containing a reference count. All objects inherit from this base class. Reference counting performs some work for every pointer copy, including passing a pointer to a function. Also, the obvious way to implement Python objects is as pointers to dynamically allocated objects. It is hard for the compiler to determine that an object can be allocated on the stack or as a global, so every object reference is indirect through a pointer.
These two forms of overhead: reference counting and indirection to objects seem to be the main difference between the reference C++ implementation and my compiled typed CSerpent implementation of the n-body problem. Does that mean reference counting and object indirection costs 3.7x? There are other optimizations in nbodyref.cpp that I suspect allow the compiler to compute expressions like bodies[j].mass more efficiently by avoiding vector templates, but this requires further investigation.
There are faster techniques than my implementation of reference counting (which is derived from code by Ahmed S. Hefny). First, there are some ingenious algorithms and data structures for reference counting that could lead to faster code. Another option is garbage collection (or tracing garbage collection if you prefer that terminology), which Serpent demonstrates can offer sub-millisecond pauses and reclaim cyclical structures. This is tricky though, and Serpent can rely on total control over a virtual machine that only manages Serpent objects. That allows us to identify all pointers and figure out what they point to. A similar collector could be implemented for compiled code, but you would have to either (1) keep an explicit stack of objects implementing local variables so you can track down all object references, but this would deny the compiler the ability to keep pointers in registers intermixed with other data (XLisp does this), or (2) devise a scheme to safely trace untyped register data in the C++ stack. This area needs more thought and investigation.

What about Javascript?

A huge amount of effort has gone into making JavaScript and Node.js very efficient. Using numbers for the n-body benchmark from Benchmark Games, Node.js is about 1.45x slower than C++, which is quite impressive. One could even imagine generating that code from something like CSerpent if you want the feel of Python and the speed of C++. Some reservations I have are the size: 25MB to 40MB for Node.js binary, the instability of Node.js, with very frequent and sometimes incompatible library updates, and the lack of real-time guarantees for garbage collection. Still, the fact that the just-in-time compiler can resolve JavaScript’s very dynamic object property lookups as well as property data types in this intense floating point computation is very impressive and suggests that the upper limit for Python speed is well above any current implementation.

Conclusions

What have I learned? First, I learned that byte-code interpreters are really slow compared to compiled code. This is not the impression I had from many papers showing the per-instruction overhead could be quite low. My quick-and-dirty compiler runs 4x faster than my interpreter (and to be clear, the interpreter is running compiled instructions on an abstract virtual machine as opposed to even slower direct interpretation of source code).

Second, merely decorating local variables with types, which I thought might really speed things up, does not help much because there are so many dynamic lookups of objects and fields, e.g. b1.x, whose types cannot be resolved at compile time.

Third, a promising approach is to mix dynamic types with type specification. Even if we keep generic collections like arrays and dictionaries, merely declaring the class of a few object references (and verifying them at run time) can give the compiler enough information to get a 25x speedup in my experiment. This is promising because it places a minimal burden on the programmer, preserves the “flavor” of Python/Serpent programming (in my opinion), and the optional annotations offer a bit of useful documentation, which might even be considered a language improvement.

Fourth, if you want to really go for broke, you can declare everything including arrays and dictionaries, making them homogeneous collections. This delivered a whopping 36x performance improvement, but note that that is only 36/25 = 1.44x what seems to be a much more Python-like language where you only declare object types where they seem useful.

Overall, I now think that adding the ability to declare object types and compiling to C++ eliminates the two main sources of overhead in running Python: byte code interpretation and resolving object types and instance variables. This seems like a “sweet spot” for language design, and I do not know of anything similar. Objective-C at least uses a dynamic type for object references in collections, but otherwise it is not very Python-like.

Caveats

All numbers should be interpreted qualitatively, especially because everything is based on one small benchmark program. Also, keep in mind that making Python faster may have almost zero impact on some applications where most of the work is in libraries implemented in C++ (e.g., PyTorch applications are largely in that category). The performance of the promising “sweet spot” just mentioned has only been estimated from other measurements. Reference counting overhead in my experiments seems to be quite high, so optimization in this area could make a significant difference. Although I am really impressed by PyPy, my impression, considering my experiments and Node.js benchmarks, is that PyPy could run a lot faster. The fact that it does not, and given that PyPy developers are obviously experts, casts some doubts on everything I have written. The final word is not in yet.