Roger B. DannenbergHome Publications Videos Opera Audacity |
![]() Photo by Alisa. |
Making Python Run Fast
The disparity between the fast development time of dynamically-typed languages like Python and their slow run-time performance is has been a constant source of both frustration and investigation. I did some experiments to measure a prototype “Python to C++” compiler and report on my findings. I believe there is an interesting avenue for investigation of a new kind of language where a few type declarations for objects and primitive types (int, real, string) could deliver dramatic performance improvements with minor impact on the usabily or “feel” of these otherwise dynamically typed languages.
The title is a bit of a teaser. This post is about making Python run fast, but the bottom line is that pure Python does not run nearly as fast as C++. It is worth looking into why this is so and whether we can get much higher performance and still get the “quality” of Python, whatever that is.
My real interest is Serpent, a Python-like language I developed for
real-time computer music applications. The advantages of Serpent are
mainly the ability to run multiple instances in a single process
(Python now supports multi-processing, but the mechanism is much
heavier than Serpent’s) and a real-time garbage
collector. Other things I like about Serpent are “real”
object methods (this
does not appear as a parameter,
and you access members with membername
instead of
this.membername
), implicit colons in most cases
(I’m always forgetting the colon after if
,
def
, and for
statements, and who needs
them?), and usually no need for backslash (\
) when
statements wrap to the next line (the compiler figures it out).
Python and Serpent are similar enough that whatever works to speed
up Serpent will work for Python and vice versa, although Python is
larger so more work is involved.
My rule of thumb is Python is about 100x slower than C or C++. This is just a rough figure and it depends on what you are doing. In particular, if your program in C++ would spend 99% of the time in compiled library routines (e.g. networking, graphics, machine learning), then rewriting the other 1% in Python makes your program only 1.99x ≈ 2x slower. That may not be significant. But let’s assume speed matters and see what we can do about it.
A real advance in Python implementation is PyPy, which claims to be 4.8x faster than the more common CPython implementation. I'm not sure why PyPy is not the “standard” Python version. But still, 4.8x faster leaves you about 20x slower than C++, so is that the best we can hope for?
Compared to C++, there are several features of Python that make it slower:
+
operator in
Python gets compiled to the same byte code, but the implementation
of that byte code must decide at runtime whether to add two
integers, two doubles, an integer and a double, concatenate two
arrays, splice two strings, or something else. That is a lot of
decision making that happens for every operator.
Given these problems in Python, it is no wonder C++ is much faster. We have to ask if there are ways to eliminate or reduce these problems, either by clever implementation or by language design changes. We could then decide if the implementation effort or language changes give us an overall improvement. E.g. if you believe type hints in Python are a good thing, maybe you would be even happier making type specification a requirement of the language, especially if it would give a large speedup. So let’s look at some options and see what kind of speedup can be achieved.
To measure speedup, we need a benchmark, a deterministic program that we can implement in different ways and see how fast it runs. Ideally, we could have a whole suite of benchmarks emphasizing different aspects of the language or different kinds of computation. I am using a very small benchmark, the n-bodies program, from Benchmark Games. I cannot defend this as the “right” benchmark, but at least it has a good mix of structure or object field access, conditionals, floating point and integer operations, and array operations. It does not emphasize function calls, but in my opinion, if function calls are your big problem, you can probably in-line function calls in the time-critical code, so I am not so concerned about this. Similarly, you could worry about string manipulation, but a lot of string manipulation takes place in libraries for pattern matching, searching, etc., and that is something I am willing to delegate to C++, e.g. by making the functions built-in operations of the language.
My first thought was: How bad would it be to require type specifications everywhere, enabling a more-or-less direct translation to C++? I wrote a prototype cross-compiler from Serpent (with types) to C++. Writing a compiler is not a small task, but by leveraging C++, I was able to avoid a complete language parser, and instead just convert whole lines of Serpent code to C++. For C++, you need to add braces around sub-blocks and take care of a number of other language differences. The compiler is about 900 lines of Serpent.
The main idea that makes this work is adding type specifications to
Serpent. For example, this Serpent code constructs all pair-wise
combinations of the elements of list l
:
def combinations(l): result = [] for i = 0 to len(l) - 1: var x = l[i] for j = i + 1 to len(l): result.append(Pair(l[i], l[j])) return result
And here is the same program rewritten with type specifications and that serves as input to our “CSerpent to C++” compiler:
Pairs combinations(Bodies l): Pairs result = array(Pairs, 0) for Int i = 0 to len(l) - 1: Body x = l[i] for Int j = i + 1 to len(l): result.append(Pairs(l[i], l[j])) return result
This compiles to this C++ code:
Pairs combinations(Bodies l) { Pairs result = Pairs_create(0, 0); for (Int i = 0; i < len(l) - 1; i += 1) { Body x = l[i]; for (Int j = i + 1; j < len(l); j += 1) { result->append(Pair_create(l[i], l[j])); } } return result; }
The complete implementation can be found here for Python, Serpent, CSerpent, i.e. Serpent code with types, C++ compiled from CSerpent, and a reference hand-coded C++ version.
Let’s run them and see what performance we get:
Language | Run Time | Relative C++ | Speedup Over Python |
---|---|---|---|
C++ | 4.0 sec | 1x | 131x |
C++ (compiled from CSerpent) | 14.7 sec | 3.7x | 36x |
Python | 522 sec | 131x | 1x |
Serpent | 543 sec | 136x | 0.96x |
Although the experiment compiling typed Serpent (I will call it
CSerpent) to C++ gives
impressive speedup, the typing system seems ugly and
cumbersome. Perhaps it is just a question of language design, but
constructing and naming all the types, especially arrays and
dictionaries of objects, seems cumbersome. E.g. in my CSerpent
implementation, all types must be simple identifiers, so you use
a new command typedef
and constructors like
array_of(Body)
to write a definition such as:
typedef Bodies = array_of(Body)
You can see the use of Bodies
in the CSerpent example
above. This does not seem so bad, but there is a Dictionary that maps
planet names (strings) to objects of type Body
. What do I
name that? I used String_body_di
, but that is as ugly as
the C++ MapBody_table
or Body_dict
and it
might be more acceptable. I am on the fence here: Can a language have
the nice feel of Python if everything is statically typed?
This got me to thinking that
arrays and dictionaries could have elements of any type
(something like Objective-C containers which simply contain object
references), while types could be used elsewhere. Or maybe types could
be optional. Either approach requires the
implementation of single type that can store numbers, strings and
objects. I name the type Any
and give a few details below.
My next
thought was that if I can implement Any
, why not go back to dynamic
typing by declaring everything to be of type Any
?
I could use the same compiler from CSerpent to C++ and gain some
speedup, but avoid the need to declare types.
At least it is a good experiment. We can see how much speed is lost due to dynamic typing versus other design decisions.
My type Any
uses “nan-boxing” similar to
values in Serpent’s implementation. A value is represented by a
64-bit unsigned int. The high order bits determine whether the value
represents a 50-bit signed integer, a 48-bit object pointer (whose
class is indicated by a pointer in the object), a 48-bit string
pointer or a double. On current machines, the high-order bits of addresses are
zero, so 48 bits is enough. The main trick here is that doubles do not
use all possible bit patterns in the high-order bits, so the unused
patterns can serve as codes that the low-order bits should be
interpreted as pointers of different kinds and integers. Serpent even
has a special case for short strings (up to 5 bytes and a zero
terminating byte) contained entirely within the 64-bit value. Of
course, there is overhead to decode the type and convert the type,
e.g. ints must be sign-extended to form a 64-bit signed int and
pointers must have the high-order bits cleared to form valid addresses.
The implementation of Any
also requires operator
overloading so we can write a + b
even if a
and/or b
is of type Any
. Similarly, built-in
functions for array indexing, dictionary lookup, etc., need to have
versions that take parameters of type Any
, check for
valid types, decode the types and run the function.
Finally, the compiler cannot compile method calls or instance
variable access on values of type Any
. My solution is to
translate, say, foo.fn(x)
into foo.call(FN_MSG,
x)
where FN_MSG is a unique value (I used a global char
*
initialized to "fn"
). Every class implements its
own call
method that uses if/else if/else
if/...
code to select the proper method. My prototype classes
have few methods, so this linear search is probably optimal. Of
course, a more sophisticated dictionary could be implemented. Instance
variables are implemented in the same way, but within methods, the
variables are simply named, and C++ accesses the variable at a known
offset, so there is no need to search for it.
I do not have a fully working compiler for this approach, but I was
able to construct a working benchmark in a way that could be
translated automatically. Here is what the compiled
combinations
routine looks like in this version:
Any combinations(Any l) { Any result(new Array_class); int ub = l.call(SIZE_MSG).must_get_int(); for (int i = 0; i < ub - 1; i += 1) { Any any_i(i); int ub2 = l.call(SIZE_MSG).must_get_int(); for (int j = (any_i + 1).must_get_int(); j < ub2; j += 1) { Any any_j(j); result.call(APPEND_MSG, Pair_create(l[any_i], l[any_j])); } } return result; }
Notice everything is declared Any
except for
for
loop variables, which are introduced by the compiler
and guaranteed to be integers by the language definition. (Loops work
a little differently in Serpent than Python, but I think the general
approach would work in either language.) Since the actual type
represented by Any
is variable, functions require dynamic
lookup and dispatch. For example, result.append(...)
is
compiled to result.call(APPEND_MSG, ...);
.
The run time for this approach, added to our previous table gives us:
Language | Run Time | Relative to C++ | Speedup Over Python |
---|---|---|---|
C++ | 4.0 sec | 1x | 131x |
C++ (compiled from CSerpent) | 14.7 sec | 3.7x | 36x |
C++ (from CSerpent with dynamic type Any) | 129 sec | 32x | 4.0x |
Python | 522 sec | 131x | 1x |
Serpent | 543 sec | 136x | 0.96x |
As shown in the middle row, this compiled dynamic typing approach gives a solid improvement over Python, but it is a long way from the performance with static typing.
In Serpent, I often find myself declaring a local variable with
int
by accident (all locals in Serpent should be declared with
var
, unlike Python's implicit declaration of locals upon
assignment). What would happen if we declared float and integer
types but left everything else to dynamic typing? This would be a
minimal burden on the programmer.
I manually altered the fully dynamic version based on type
Any
by declaring all integers and floats as such. The
highlighted line in this updated table shows the results:
Language | Run Time | Relative to C++ | Speedup Over Python |
---|---|---|---|
C++ | 4.0 sec | 1x | 131x |
C++ (compiled from CSerpent) | 14.7 sec | 3.7x | 36x |
C++ (from CSerpent with mixed Any, double, and int types) | 95 sec | 22x | 5.8x |
C++ (from CSerpent with dynamic type Any) | 129 sec | 32x | 4.0x |
Python | 522 sec | 131x | 1x |
Serpent | 543 sec | 136x | 0.96x |
There is some speedup with declarations for float and integer
types, but the improvement is limited by the many field (instance
variable) accesses, which are to type Any
object
references. These require dynamic lookup of fields, and since the
fields could be declared to be any type in any object, the return
result must be Any
. So we not only pay for dynamic lookup
but also to encode and decode the float from type double
to type Any
and back. Converting back requires a type
check (consider that it could be an int
and we could be
validly multiplying by an int
). Also notice we are
already faster than PyPy's claim to 4.8x over CPython. A little typing
can go a long way.
But this version is still 22x slower than C and not even 10x faster than Serpent or Python.
Going back to the 5 reasons Python is slow, what can we say about them now?
Any
for everything), we see a ratio of 4x
(this is actually the ratio of Python to compiled Serpent, but there
is only a 4% difference between Serpent and Python on this
benchmark, so I will consider Serpent and Python to be equal). It's
surprising to me that the virtual machine apparently accounts for
75% of Python compute time given how much additional time is
required for conditional code depending on operand types. A lot of
work has gone into making byte-code interpreters faster, and C++'s
relatively new Labels as Values feature might help create
faster virtual machines without resorting to assembly. But the
improvements have an upper bound of 4x unless the dynamic typing problem
is addressed.
Any
, int
and double
types, the improvement over using Any
for all types was
slight (speed increased 45% from 4x to 5.8x over Python). In this
benchmark, classes are used as structures
containing fields x
, y
, z
,
vx
, vy
, vz
and
m, and even though they are stored as type
double
, the compiler does not have the type information
to avoid decoding the object reference from Any
,
dynamically finding the instance variable, and encoding it into an
Any
which must then be decoded immediately to a
double
. If code is rewritten to use all local
variables and if we introduce pure arrays or tuples of doubles, then
we can get much more speedup. Alternatively, a just-in-time compiler
that generates specialized code to handle the case of all doubles and
even in-line access to instances should perform well here.
some_object.some_field
is a bigger problem. First, if you
do not know the type of some_object
, you need to decode
it. Second, since you did not know the object type, you could not look
up the location of some_field
at compile time. A second
dynamic lookup is required. And third, after all that work, the
compiler does not know the field (instance variable) type, so that
type must be decoded to perform any operation with it.
This is the main difference between the “compiled from
CSerpent” and the “from CSerpent with mixed Any, double
and int types” versions. In the former, we had object types and
therefore we could resolve instance variable locations and types. In
the latter, every expression like b1.x
required two
dynamic lookups and resulted in type Any
which required
a further conditional to test for a double
. The
difference in speed was a factor of 6, so we can say, roughly, that
84% of the runtime is devoted to dynamic typing of objects and
instance variables.
Notice that we could replace all instance
variable accesses of the form b1.x
with
“getter” functions like b1.get_x()
, because
within the method, we can know the offset of
x
at compile time. This does not help, though, because now
we have to do a dynamic lookup to find the get_x
method.
One way to regain some lost performance is to make type
declarations for classes optional, e.g. consider this Serpent from
nbody.srp
:
for pair in PAIRS: var b1 = pair.b1 var b2 = pair.b2 var dx = b1.x - b2.x var dy = b1.y - b2.y var dz = b1.z - b2.z e = e - (b1.m * b2.m) / pow(dx * dx + dy * dy + dz * dz, 0.5)
We could make a small addition to declare pair
to be
an instance of Pair
:
for Pair pair in PAIRS: var b1 = pair.b1 var b2 = pair.b2 var dx = b1.x - b2.x var dy = b1.y - b2.y var dz = b1.z - b2.z e = e - (b1.m * b2.m) / pow(dx * dx + dy * dy + dz * dz, 0.5)
With this simple change, the compiler can determine from
Pair
that pair.b1
is a Body
(if
the instance variable b1
is declared that way), and that
b1.x
and others are floats. This cuts the number of dynamic
lookups from 10 to 1, so that 84% overhead is cut to 8.4%, and
extrapolating to the whole program, we might get down 5.3x C++
performance, or a 25x speedup over Python. This deserves another
table:
Language | Run Time | Relative to C++ | Speedup Over Python |
---|---|---|---|
C++ | 4.0 sec | 1x | 131x |
C++ (compiled from CSerpent) | 14.7 sec | 3.7x | 36x |
C++ (from CSerpent with object and primitive type declarations - estimated) | 21 sec | 5.3x | 25x |
C++ (from CSerpent with mixed Any, double, and int types) | 95 sec | 22x | 5.8x |
C++ (from CSerpent with dynamic type Any) | 129 sec | 32x | 4.0x |
Python | 522 sec | 131x | 1x |
Serpent | 543 sec | 136x | 0.96x |
These two forms of overhead: reference counting and indirection to
objects seem to be the main difference between the reference C++
implementation and my compiled typed CSerpent implementation of the
n-body problem. Does that mean reference counting and object
indirection costs 3.7x? There are other optimizations in
nbodyref.cpp
that I suspect allow the compiler to compute
expressions like bodies[j].mass
more efficiently by
avoiding vector
templates, but this requires further
investigation.
There are faster techniques than my implementation of reference counting (which is derived from code by Ahmed S. Hefny). First, there are some ingenious algorithms and data structures for reference counting that could lead to faster code. Another option is garbage collection (or tracing garbage collection if you prefer that terminology), which Serpent demonstrates can offer sub-millisecond pauses and reclaim cyclical structures. This is tricky though, and Serpent can rely on total control over a virtual machine that only manages Serpent objects. That allows us to identify all pointers and figure out what they point to. A similar collector could be implemented for compiled code, but you would have to either (1) keep an explicit stack of objects implementing local variables so you can track down all object references, but this would deny the compiler the ability to keep pointers in registers intermixed with other data (XLisp does this), or (2) devise a scheme to safely trace untyped register data in the C++ stack. This area needs more thought and investigation.
What have I learned? First, I learned that byte-code interpreters are really slow compared to compiled code. This is not the impression I had from many papers showing the per-instruction overhead could be quite low. My quick-and-dirty compiler runs 4x faster than my interpreter (and to be clear, the interpreter is running compiled instructions on an abstract virtual machine as opposed to even slower direct interpretation of source code).
Second, merely decorating local variables with types, which I
thought might really speed things up, does not help much because there
are so many dynamic lookups of objects and fields,
e.g. b1.x
, whose types cannot be resolved at compile
time.
Third, a promising approach is to mix dynamic types with type specification. Even if we keep generic collections like arrays and dictionaries, merely declaring the class of a few object references (and verifying them at run time) can give the compiler enough information to get a 25x speedup in my experiment. This is promising because it places a minimal burden on the programmer, preserves the “flavor” of Python/Serpent programming (in my opinion), and the optional annotations offer a bit of useful documentation, which might even be considered a language improvement.
Fourth, if you want to really go for broke, you can declare everything including arrays and dictionaries, making them homogeneous collections. This delivered a whopping 36x performance improvement, but note that that is only 36/25 = 1.44x what seems to be a much more Python-like language where you only declare object types where they seem useful.
Overall, I now think that adding the ability to declare object types and compiling to C++ eliminates the two main sources of overhead in running Python: byte code interpretation and resolving object types and instance variables. This seems like a “sweet spot” for language design, and I do not know of anything similar. Objective-C at least uses a dynamic type for object references in collections, but otherwise it is not very Python-like.