Roger B. Dannenberg

Home Publications Videos Opera Audacity
Dannenberg playing trumpet.
Photo by Alisa.

Making Python Run Fast


Summary

The disparity between the fast development time of dynamically-typed languages like Python and their slow run-time performance is has been a constant source of both frustration and investigation. I did some experiments to measure a prototype “Python to C++” compiler and report on my findings. I believe there is an interesting avenue for investigation of a new kind of language where a few type declarations for objects and primitive types (int, real, string) could deliver dramatic performance improvements with minor impact on the usabily or “feel” of these otherwise dynamically typed languages.

About the Author

Roger B. Dannenberg is Emeritus Professor of Computer Science at Carnegie Mellon University. He is known for a broad range of research in Computer Music, including the creation of interactive computer accompaniment systems, languages for computer music, music understanding systems, and music composing software. He is a co-creator of Audacity, perhaps the most widely used music editing software.

The title is a bit of a teaser. This post is about making Python run fast, but the bottom line is that pure Python does not run nearly as fast as C++. It is worth looking into why this is so and whether we can get much higher performance and still get the “quality” of Python, whatever that is.

My real interest is Serpent, a Python-like language I developed for real-time computer music applications. The advantages of Serpent are mainly the ability to run multiple instances in a single process (Python now supports multi-processing, but the mechanism is much heavier than Serpent’s) and a real-time garbage collector. Other things I like about Serpent are “real” object methods (this does not appear as a parameter, and you access members with membername instead of this.membername), implicit colons in most cases (I’m always forgetting the colon after if, def, and for statements, and who needs them?), and usually no need for backslash (\) when statements wrap to the next line (the compiler figures it out). Python and Serpent are similar enough that whatever works to speed up Serpent will work for Python and vice versa, although Python is larger so more work is involved.

How Slow Is Python?

My rule of thumb is Python is about 100x slower than C or C++. This is just a rough figure and it depends on what you are doing. In particular, if your program in C++ would spend 99% of the time in compiled library routines (e.g. networking, graphics, machine learning), then rewriting the other 1% in Python makes your program only 1.99x ≈ 2x slower. That may not be significant. But let’s assume speed matters and see what we can do about it.

A real advance in Python implementation is PyPy, which claims to be 4.8x faster than the more common CPython implementation. I'm not sure why PyPy is not the “standard” Python version. But still, 4.8x faster leaves you about 20x slower than C++, so is that the best we can hope for?

Why is Python Slow?

Compared to C++, there are several features of Python that make it slower:

Given these problems in Python, it is no wonder C++ is much faster. We have to ask if there are ways to eliminate or reduce these problems, either by clever implementation or by language design changes. We could then decide if the implementation effort or language changes give us an overall improvement. E.g. if you believe type hints in Python are a good thing, maybe you would be even happier making type specification a requirement of the language, especially if it would give a large speedup. So let’s look at some options and see what kind of speedup can be achieved.

Benchmark

To measure speedup, we need a benchmark, a deterministic program that we can implement in different ways and see how fast it runs. Ideally, we could have a whole suite of benchmarks emphasizing different aspects of the language or different kinds of computation. I am using a very small benchmark, the n-bodies program, from Benchmark Games. I cannot defend this as the “right” benchmark, but at least it has a good mix of structure or object field access, conditionals, floating point and integer operations, and array operations. It does not emphasize function calls, but in my opinion, if function calls are your big problem, you can probably in-line function calls in the time-critical code, so I am not so concerned about this. Similarly, you could worry about string manipulation, but a lot of string manipulation takes place in libraries for pattern matching, searching, etc., and that is something I am willing to delegate to C++, e.g. by making the functions built-in operations of the language.

Complete Typing

My first thought was: How bad would it be to require type specifications everywhere, enabling a more-or-less direct translation to C++? I wrote a prototype cross-compiler from Serpent (with types) to C++. Writing a compiler is not a small task, but by leveraging C++, I was able to avoid a complete language parser, and instead just convert whole lines of Serpent code to C++. For C++, you need to add braces around sub-blocks and take care of a number of other language differences. The compiler is about 900 lines of Serpent.

The main idea that makes this work is adding type specifications to Serpent. For example, this Serpent code constructs all pair-wise combinations of the elements of list l:

def combinations(l):
    result = []
    for i = 0 to len(l) - 1:
        var x = l[i]
        for j = i + 1 to len(l):
            result.append(Pair(l[i], l[j]))
    return result

And here is the same program rewritten with type specifications and that serves as input to our “CSerpent to C++” compiler:

Pairs combinations(Bodies l):
    Pairs result = array(Pairs, 0)
    for Int i = 0 to len(l) - 1:
        Body x = l[i]
        for Int j = i + 1 to len(l):
            result.append(Pairs(l[i], l[j]))
    return result

This compiles to this C++ code:

Pairs combinations(Bodies l) {
    Pairs result = Pairs_create(0, 0);
    for (Int i = 0; i < len(l) - 1; i += 1) {
        Body x = l[i];
        for (Int j = i + 1; j < len(l); j += 1) {
            result->append(Pair_create(l[i], l[j]));
        }
    }
    return result;
}

The complete implementation can be found here for Python, Serpent, CSerpent, i.e. Serpent code with types, C++ compiled from CSerpent, and a reference hand-coded C++ version.

Let’s run them and see what performance we get:

Language Run Time Relative C++ Speedup Over Python
C++ 4.0 sec 1x 131x
C++ (compiled from CSerpent) 14.7 sec 3.7x 36x
Python 522 sec 131x 1x
Serpent 543 sec 136x 0.96x

Although the experiment compiling typed Serpent (I will call it CSerpent) to C++ gives impressive speedup, the typing system seems ugly and cumbersome. Perhaps it is just a question of language design, but constructing and naming all the types, especially arrays and dictionaries of objects, seems cumbersome. E.g. in my CSerpent implementation, all types must be simple identifiers, so you use a new command typedef and constructors like array_of(Body) to write a definition such as:

typedef Bodies = array_of(Body)

You can see the use of Bodies in the CSerpent example above. This does not seem so bad, but there is a Dictionary that maps planet names (strings) to objects of type Body. What do I name that? I used String_body_di, but that is as ugly as the C++ Map as a type. In retrospect, I could have called it Body_table or Body_dict and it might be more acceptable. I am on the fence here: Can a language have the nice feel of Python if everything is statically typed? This got me to thinking that arrays and dictionaries could have elements of any type (something like Objective-C containers which simply contain object references), while types could be used elsewhere. Or maybe types could be optional. Either approach requires the implementation of single type that can store numbers, strings and objects. I name the type Any and give a few details below.

Compiled with Dynamic Typing

My next thought was that if I can implement Any, why not go back to dynamic typing by declaring everything to be of type Any? I could use the same compiler from CSerpent to C++ and gain some speedup, but avoid the need to declare types.

At least it is a good experiment. We can see how much speed is lost due to dynamic typing versus other design decisions.

My type Any uses “nan-boxing” similar to values in Serpent’s implementation. A value is represented by a 64-bit unsigned int. The high order bits determine whether the value represents a 50-bit signed integer, a 48-bit object pointer (whose class is indicated by a pointer in the object), a 48-bit string pointer or a double. On current machines, the high-order bits of addresses are zero, so 48 bits is enough. The main trick here is that doubles do not use all possible bit patterns in the high-order bits, so the unused patterns can serve as codes that the low-order bits should be interpreted as pointers of different kinds and integers. Serpent even has a special case for short strings (up to 5 bytes and a zero terminating byte) contained entirely within the 64-bit value. Of course, there is overhead to decode the type and convert the type, e.g. ints must be sign-extended to form a 64-bit signed int and pointers must have the high-order bits cleared to form valid addresses.

The implementation of Any also requires operator overloading so we can write a + b even if a and/or b is of type Any. Similarly, built-in functions for array indexing, dictionary lookup, etc., need to have versions that take parameters of type Any, check for valid types, decode the types and run the function.

Finally, the compiler cannot compile method calls or instance variable access on values of type Any. My solution is to translate, say, foo.fn(x) into foo.call(FN_MSG, x) where FN_MSG is a unique value (I used a global char * initialized to "fn"). Every class implements its own call method that uses if/else if/else if/... code to select the proper method. My prototype classes have few methods, so this linear search is probably optimal. Of course, a more sophisticated dictionary could be implemented. Instance variables are implemented in the same way, but within methods, the variables are simply named, and C++ accesses the variable at a known offset, so there is no need to search for it.

I do not have a fully working compiler for this approach, but I was able to construct a working benchmark in a way that could be translated automatically. Here is what the compiled combinations routine looks like in this version:

Any combinations(Any l) {
    Any result(new Array_class);
    int ub = l.call(SIZE_MSG).must_get_int();
    for (int i = 0; i < ub - 1; i += 1) {
        Any any_i(i);
        int ub2 = l.call(SIZE_MSG).must_get_int();
        for (int j = (any_i + 1).must_get_int(); j < ub2; j += 1) {
            Any any_j(j);
            result.call(APPEND_MSG, Pair_create(l[any_i], l[any_j]));
        }
    }
    return result;
}

Notice everything is declared Any except for for loop variables, which are introduced by the compiler and guaranteed to be integers by the language definition. (Loops work a little differently in Serpent than Python, but I think the general approach would work in either language.) Since the actual type represented by Any is variable, functions require dynamic lookup and dispatch. For example, result.append(...) is compiled to result.call(APPEND_MSG, ...);.

The run time for this approach, added to our previous table gives us:

Language Run Time Relative to C++ Speedup Over Python
C++ 4.0 sec 1x 131x
C++ (compiled from CSerpent) 14.7 sec 3.7x 36x
C++ (from CSerpent with dynamic type Any) 129 sec 32x 4.0x
Python 522 sec 131x 1x
Serpent 543 sec 136x 0.96x

As shown in the middle row, this compiled dynamic typing approach gives a solid improvement over Python, but it is a long way from the performance with static typing.

Declared ints and floats

In Serpent, I often find myself declaring a local variable with int by accident (all locals in Serpent should be declared with var, unlike Python's implicit declaration of locals upon assignment). What would happen if we declared float and integer types but left everything else to dynamic typing? This would be a minimal burden on the programmer.

I manually altered the fully dynamic version based on type Any by declaring all integers and floats as such. The highlighted line in this updated table shows the results:

Language Run Time Relative to C++ Speedup Over Python
C++ 4.0 sec 1x 131x
C++ (compiled from CSerpent) 14.7 sec 3.7x 36x
C++ (from CSerpent with mixed Any, double, and int types) 95 sec 22x 5.8x
C++ (from CSerpent with dynamic type Any) 129 sec 32x 4.0x
Python 522 sec 131x 1x
Serpent 543 sec 136x 0.96x

There is some speedup with declarations for float and integer types, but the improvement is limited by the many field (instance variable) accesses, which are to type Any object references. These require dynamic lookup of fields, and since the fields could be declared to be any type in any object, the return result must be Any. So we not only pay for dynamic lookup but also to encode and decode the float from type double to type Any and back. Converting back requires a type check (consider that it could be an int and we could be validly multiplying by an int). Also notice we are already faster than PyPy's claim to 4.8x over CPython. A little typing can go a long way.

But this version is still 22x slower than C and not even 10x faster than Serpent or Python.

Analysis and Discussion

Going back to the 5 reasons Python is slow, what can we say about them now?

What about Javascript?

A huge amount of effort has gone into making JavaScript and Node.js very efficient. Using numbers for the n-body benchmark from Benchmark Games, Node.js is about 1.45x slower than C++, which is quite impressive. One could even imagine generating that code from something like CSerpent if you want the feel of Python and the speed of C++. Some reservations I have are the size: 25MB to 40MB for Node.js binary, the instability of Node.js, with very frequent and sometimes incompatible library updates, and the lack of real-time guarantees for garbage collection. Still, the fact that the just-in-time compiler can resolve JavaScript’s very dynamic object property lookups as well as property data types in this intense floating point computation is very impressive and suggests that the upper limit for Python speed is well above any current implementation.

Conclusions

What have I learned? First, I learned that byte-code interpreters are really slow compared to compiled code. This is not the impression I had from many papers showing the per-instruction overhead could be quite low. My quick-and-dirty compiler runs 4x faster than my interpreter (and to be clear, the interpreter is running compiled instructions on an abstract virtual machine as opposed to even slower direct interpretation of source code).

Second, merely decorating local variables with types, which I thought might really speed things up, does not help much because there are so many dynamic lookups of objects and fields, e.g. b1.x, whose types cannot be resolved at compile time.

Third, a promising approach is to mix dynamic types with type specification. Even if we keep generic collections like arrays and dictionaries, merely declaring the class of a few object references (and verifying them at run time) can give the compiler enough information to get a 25x speedup in my experiment. This is promising because it places a minimal burden on the programmer, preserves the “flavor” of Python/Serpent programming (in my opinion), and the optional annotations offer a bit of useful documentation, which might even be considered a language improvement.

Fourth, if you want to really go for broke, you can declare everything including arrays and dictionaries, making them homogeneous collections. This delivered a whopping 36x performance improvement, but note that that is only 36/25 = 1.44x what seems to be a much more Python-like language where you only declare object types where they seem useful.

Overall, I now think that adding the ability to declare object types and compiling to C++ eliminates the two main sources of overhead in running Python: byte code interpretation and resolving object types and instance variables. This seems like a “sweet spot” for language design, and I do not know of anything similar. Objective-C at least uses a dynamic type for object references in collections, but otherwise it is not very Python-like.

Caveats

All numbers should be interpreted qualitatively, especially because everything is based on one small benchmark program. Also, keep in mind that making Python faster may have almost zero impact on some applications where most of the work is in libraries implemented in C++ (e.g., PyTorch applications are largely in that category). The performance of the promising “sweet spot” just mentioned has only been estimated from other measurements. Reference counting overhead in my experiments seems to be quite high, so optimization in this area could make a significant difference. Although I am really impressed by PyPy, my impression, considering my experiments and Node.js benchmarks, is that PyPy could run a lot faster. The fact that it does not, and given that PyPy developers are obviously experts, casts some doubts on everything I have written. The final word is not in yet.