Roger B. DannenbergHome Publications Videos Opera Audacity |
Photo by Alisa. |
Making Python Run Fast
The disparity between the fast development time of dynamically-typed languages like Python and their slow run-time performance is has been a constant source of both frustration and investigation. I did some experiments to measure a prototype “Python to C++” compiler and report on my findings. I believe there is an interesting avenue for investigation of a new kind of language where a few type declarations for objects and primitive types (int, real, string) could deliver dramatic performance improvements with minor impact on the usabily or “feel” of these otherwise dynamically typed languages.
The title is a bit of a teaser. This post is about making Python run fast, but the bottom line is that pure Python does not run nearly as fast as C++. It is worth looking into why this is so and whether we can get much higher performance and still get the “quality” of Python, whatever that is.
My real interest is Serpent, a Python-like language I developed for
real-time computer music applications. The advantages of Serpent are
mainly the ability to run multiple instances in a single process
(Python now supports multi-processing, but the mechanism is much
heavier than Serpent’s) and a real-time garbage
collector. Other things I like about Serpent are “real”
object methods (this
does not appear as a parameter,
and you access members with membername
instead of
this.membername
), implicit colons in most cases
(I’m always forgetting the colon after if
,
def
, and for
statements, and who needs
them?), and usually no need for backslash (\
) when
statements wrap to the next line (the compiler figures it out).
Python and Serpent are similar enough that whatever works to speed
up Serpent will work for Python and vice versa, although Python is
larger so more work is involved.
My rule of thumb is Python is about 100x slower than C or C++. This is just a rough figure and it depends on what you are doing. In particular, if your program in C++ would spend 99% of the time in compiled library routines (e.g. networking, graphics, machine learning), then rewriting the other 1% in Python makes your program only 1.99x ≈ 2x slower. That may not be significant. But let’s assume speed matters and see what we can do about it.
A real advance in Python implementation is PyPy, which claims to be 4.8x faster than the more common CPython implementation. I'm not sure why PyPy is not the “standard” Python version. But still, 4.8x faster leaves you about 20x slower than C++, so is that the best we can hope for?
Compared to C++, there are several features of Python that make it slower:
+
operator in
Python gets compiled to the same byte code, but the implementation
of that byte code must decide at runtime whether to add two
integers, two doubles, an integer and a double, concatenate two
arrays, splice two strings, or something else. That is a lot of
decision making that happens for every operator.
Given these problems in Python, it is no wonder C++ is much faster. We have to ask if there are ways to eliminate or reduce these problems, either by clever implementation or by language design changes. We could then decide if the implementation effort or language changes give us an overall improvement. E.g. if you believe type hints in Python are a good thing, maybe you would be even happier making type specification a requirement of the language, especially if it would give a large speedup. So let’s look at some options and see what kind of speedup can be achieved.
To measure speedup, we need a benchmark, a deterministic program that we can implement in different ways and see how fast it runs. Ideally, we could have a whole suite of benchmarks emphasizing different aspects of the language or different kinds of computation. I am using a very small benchmark, the n-bodies program, from Benchmark Games. I cannot defend this as the “right” benchmark, but at least it has a good mix of structure or object field access, conditionals, floating point and integer operations, and array operations. It does not emphasize function calls, but in my opinion, if function calls are your big problem, you can probably in-line function calls in the time-critical code, so I am not so concerned about this. Similarly, you could worry about string manipulation, but a lot of string manipulation takes place in libraries for pattern matching, searching, etc., and that is something I am willing to delegate to C++, e.g. by making the functions built-in operations of the language.
My first thought was: How bad would it be to require type specifications everywhere, enabling a more-or-less direct translation to C++? I wrote a prototype cross-compiler from Serpent (with types) to C++. Writing a compiler is not a small task, but by leveraging C++, I was able to avoid a complete language parser, and instead just convert whole lines of Serpent code to C++. For C++, you need to add braces around sub-blocks and take care of a number of other language differences. The compiler is about 900 lines of Serpent.
The main idea that makes this work is adding type specifications to
Serpent. For example, this Serpent code constructs all pair-wise
combinations of the elements of list l
:
def combinations(l): result = [] for i = 0 to len(l) - 1: var x = l[i] for j = i + 1 to len(l): result.append(Pair(l[i], l[j])) return result
And here is the same program rewritten with type specifications and that serves as input to our “CSerpent to C++” compiler:
Pairs combinations(Bodies l): Pairs result = array(Pairs, 0) for Int i = 0 to len(l) - 1: Body x = l[i] for Int j = i + 1 to len(l): result.append(Pairs(l[i], l[j])) return result
This compiles to this C++ code:
Pairs combinations(Bodies l) { Pairs result = Pairs_create(0, 0); for (Int i = 0; i < len(l) - 1; i += 1) { Body x = l[i]; for (Int j = i + 1; j < len(l); j += 1) { result->append(Pair_create(l[i], l[j])); } } return result; }
The complete implementation can be found here for Python, Serpent, CSerpent, i.e. Serpent code with types, C++ compiled from CSerpent, and a reference hand-coded C++ version.
Let’s run them and see what performance we get:
Language | Run Time | Relative C++ | Speedup Over Python |
---|---|---|---|
C++ | 4.0 sec | 1x | 131x |
C++ (compiled from CSerpent) | 14.7 sec | 3.7x | 36x |
Python | 522 sec | 131x | 1x |
Serpent | 543 sec | 136x | 0.96x |
Although the experiment compiling typed Serpent (I will call it
CSerpent) to C++ gives
impressive speedup, the typing system seems ugly and
cumbersome. Perhaps it is just a question of language design, but
constructing and naming all the types, especially arrays and
dictionaries of objects, seems cumbersome. E.g. in my CSerpent
implementation, all types must be simple identifiers, so you use
a new command typedef
and constructors like
array_of(Body)
to write a definition such as:
typedef Bodies = array_of(Body)
You can see the use of My next
thought was that if I can implement At least it is a good experiment. We can see how much speed is
lost due to dynamic typing versus other design decisions.
My type The implementation of Finally, the compiler cannot compile method calls or instance
variable access on values of type I do not have a fully working compiler for this approach, but I was
able to construct a working benchmark in a way that could be
translated automatically. Here is what the compiled
Notice everything is declared
The run time for this approach, added to
our previous table gives us:
As shown in the middle row, this compiled dynamic typing approach
gives a solid improvement over Python, but it is a long way from the
performance with static typing.
In Serpent, I often find myself declaring a local variable with
I manually altered the fully dynamic version based on type
There is some speedup with declarations for float and integer
types, but the improvement is limited by the many field (instance
variable) accesses, which are to type But this version is still 22x slower than C and not even 10x faster
than Serpent or Python.
Going back to the 5 reasons Python is slow, what can we say about them
now?
This is the main difference between the “compiled from
CSerpent” and the “from CSerpent with mixed Any, double
and int types” versions. In the former, we had object types and
therefore we could resolve instance variable locations and types. In
the latter, every expression like Notice that we could replace all instance
variable accesses of the form One way to regain some lost performance is to make type
declarations for classes optional, e.g. consider this Serpent from
We could make a small addition to declare With this simple change, the compiler can determine from
These two forms of overhead: reference counting and indirection to
objects seem to be the main difference between the reference C++
implementation and my compiled typed CSerpent implementation of the
n-body problem. Does that mean reference counting and object
indirection costs 3.7x? There are other optimizations in
There are faster techniques than my implementation of reference
counting (which is derived from code by Ahmed
S. Hefny). First, there are some ingenious
algorithms and data structures for reference counting that could
lead to faster code. Another option is garbage collection (or
tracing garbage collection if you prefer that terminology),
which Serpent demonstrates can offer sub-millisecond pauses and
reclaim cyclical structures. This is tricky though, and Serpent can
rely on total control over a virtual machine that only manages Serpent
objects. That allows us to identify all pointers and figure out what
they point to. A similar collector could be implemented for compiled
code, but you would have to either (1) keep an explicit stack of objects
implementing local variables so you can track down all object
references, but this would deny the compiler the ability to keep pointers in
registers intermixed with other data (XLisp does this), or (2) devise a scheme to safely
trace untyped register data in the C++ stack. This area needs more
thought and investigation.
What have I learned? First, I learned that byte-code interpreters are
really slow compared to compiled code. This is not the impression I
had from many papers showing the per-instruction overhead could be
quite low. My quick-and-dirty compiler runs 4x faster than my
interpreter (and to be clear, the interpreter is running compiled
instructions on an abstract virtual machine as opposed to even slower
direct interpretation of source code).
Second, merely decorating local variables with types, which I
thought might really speed things up, does not help much because there
are so many dynamic lookups of objects and fields,
e.g. Third, a promising approach is to mix dynamic types with type
specification. Even if we keep generic collections like arrays and
dictionaries, merely declaring the class of a few object references
(and verifying them at run time) can give the compiler enough
information to get a 25x speedup in my experiment. This is promising
because it places a minimal burden on the programmer, preserves the
“flavor” of Python/Serpent programming (in my opinion),
and the optional annotations offer a bit of useful documentation,
which might even be considered a language improvement.
Fourth, if you want to really go for broke, you can declare
everything including arrays and dictionaries, making them homogeneous
collections. This delivered a whopping 36x performance improvement, but
note that that is only 36/25 = 1.44x what seems to be a much more
Python-like language where you only declare object types where they
seem useful.
Overall, I now think that adding the ability to declare object
types and compiling to C++ eliminates the two main sources of overhead
in running Python: byte code interpretation and resolving object types
and instance variables. This seems like a “sweet spot” for
language design, and I do not know of anything similar. Objective-C
at least uses a dynamic type for object references in collections, but otherwise it
is not very Python-like.
Bodies
in the CSerpent example
above. This does not seem so bad, but there is a Dictionary that maps
planet names (strings) to objects of type Body
. What do I
name that? I used String_body_di
, but that is as ugly as
the C++ MapBody_table
or Body_dict
and it
might be more acceptable. I am on the fence here: Can a language have
the nice feel of Python if everything is statically typed?
This got me to thinking that
arrays and dictionaries could have elements of any type
(something like Objective-C containers which simply contain object
references), while types could be used elsewhere. Or maybe types could
be optional. Either approach requires the
implementation of single type that can store numbers, strings and
objects. I name the type Any
and give a few details below.
Compiled with Dynamic Typing
Any
, why not go back to dynamic
typing by declaring everything to be of type Any
?
I could use the same compiler from CSerpent to C++ and gain some
speedup, but avoid the need to declare types.
Any
uses “nan-boxing” similar to
values in Serpent’s implementation. A value is represented by a
64-bit unsigned int. The high order bits determine whether the value
represents a 50-bit signed integer, a 48-bit object pointer (whose
class is indicated by a pointer in the object), a 48-bit string
pointer or a double. On current machines, the high-order bits of addresses are
zero, so 48 bits is enough. The main trick here is that doubles do not
use all possible bit patterns in the high-order bits, so the unused
patterns can serve as codes that the low-order bits should be
interpreted as pointers of different kinds and integers. Serpent even
has a special case for short strings (up to 5 bytes and a zero
terminating byte) contained entirely within the 64-bit value. Of
course, there is overhead to decode the type and convert the type,
e.g. ints must be sign-extended to form a 64-bit signed int and
pointers must have the high-order bits cleared to form valid addresses.
Any
also requires operator
overloading so we can write a + b
even if a
and/or b
is of type Any
. Similarly, built-in
functions for array indexing, dictionary lookup, etc., need to have
versions that take parameters of type Any
, check for
valid types, decode the types and run the function.
Any
. My solution is to
translate, say, foo.fn(x)
into foo.call(FN_MSG,
x)
where FN_MSG is a unique value (I used a global char
*
initialized to "fn"
). Every class implements its
own call
method that uses if/else if/else
if/...
code to select the proper method. My prototype classes
have few methods, so this linear search is probably optimal. Of
course, a more sophisticated dictionary could be implemented. Instance
variables are implemented in the same way, but within methods, the
variables are simply named, and C++ accesses the variable at a known
offset, so there is no need to search for it.
combinations
routine looks like in this version:
Any combinations(Any l) {
Any result(new Array_class);
int ub = l.call(SIZE_MSG).must_get_int();
for (int i = 0; i < ub - 1; i += 1) {
Any any_i(i);
int ub2 = l.call(SIZE_MSG).must_get_int();
for (int j = (any_i + 1).must_get_int(); j < ub2; j += 1) {
Any any_j(j);
result.call(APPEND_MSG, Pair_create(l[any_i], l[any_j]));
}
}
return result;
}
Any
except for
for
loop variables, which are introduced by the compiler
and guaranteed to be integers by the language definition. (Loops work
a little differently in Serpent than Python, but I think the general
approach would work in either language.) Since the actual type
represented by Any
is variable, functions require dynamic
lookup and dispatch. For example, result.append(...)
is
compiled to result.call(APPEND_MSG, ...);
.
Language
Run Time
Relative to C++
Speedup Over Python
C++
4.0 sec
1x
131x
C++ (compiled from CSerpent)
14.7 sec
3.7x
36x
C++ (from CSerpent with dynamic type Any)
129 sec
32x
4.0x
Python
522 sec
131x
1x
Serpent
543 sec
136x
0.96x
Declared ints and floats
int
by accident (all locals in Serpent should be declared with
var
, unlike Python's implicit declaration of locals upon
assignment). What would happen if we declared float and integer
types but left everything else to dynamic typing? This would be a
minimal burden on the programmer.
Any
by declaring all integers and floats as such. The
highlighted line in this updated table shows the results:
Language
Run Time
Relative to C++
Speedup Over Python
C++
4.0 sec
1x
131x
C++ (compiled from CSerpent)
14.7 sec
3.7x
36x
C++ (from CSerpent with mixed Any,
double, and int types)
95 sec
22x
5.8x
C++ (from CSerpent with dynamic type Any)
129 sec
32x
4.0x
Python
522 sec
131x
1x
Serpent
543 sec
136x
0.96x
Any
object
references. These require dynamic lookup of fields, and since the
fields could be declared to be any type in any object, the return
result must be Any
. So we not only pay for dynamic lookup
but also to encode and decode the float from type double
to type Any
and back. Converting back requires a type
check (consider that it could be an int
and we could be
validly multiplying by an int
). Also notice we are
already faster than PyPy's claim to 4.8x over CPython. A little typing
can go a long way.
Analysis and Discussion
Any
for everything), we see a ratio of 4x
(this is actually the ratio of Python to compiled Serpent, but there
is only a 4% difference between Serpent and Python on this
benchmark, so I will consider Serpent and Python to be equal). It's
surprising to me that the virtual machine apparently accounts for
75% of Python compute time given how much additional time is
required for conditional code depending on operand types. A lot of
work has gone into making byte-code interpreters faster, and C++'s
relatively new Labels as Values feature might help create
faster virtual machines without resorting to assembly. But the
improvements have an upper bound of 4x unless the dynamic typing problem
is addressed.
Any
, int
and double
types, the improvement over using Any
for all types was
slight (speed increased 45% from 4x to 5.8x over Python). In this
benchmark, classes are used as structures
containing fields x
, y
, z
,
vx
, vy
, vz
and
m, and even though they are stored as type
double
, the compiler does not have the type information
to avoid decoding the object reference from Any
,
dynamically finding the instance variable, and encoding it into an
Any
which must then be decoded immediately to a
double
. If code is rewritten to use all local
variables and if we introduce pure arrays or tuples of doubles, then
we can get much more speedup. Alternatively, a just-in-time compiler
that generates specialized code to handle the case of all doubles and
even in-line access to instances should perform well here.
some_object.some_field
is a bigger problem. First, if you
do not know the type of some_object
, you need to decode
it. Second, since you did not know the object type, you could not look
up the location of some_field
at compile time. A second
dynamic lookup is required. And third, after all that work, the
compiler does not know the field (instance variable) type, so that
type must be decoded to perform any operation with it.
b1.x
required two
dynamic lookups and resulted in type Any
which required
a further conditional to test for a double
. The
difference in speed was a factor of 6, so we can say, roughly, that
84% of the runtime is devoted to dynamic typing of objects and
instance variables.
b1.x
with
“getter” functions like b1.get_x()
, because
within the method, we can know the offset of
x
at compile time. This does not help, though, because now
we have to do a dynamic lookup to find the get_x
method.
nbody.srp
:
for pair in PAIRS:
var b1 = pair.b1
var b2 = pair.b2
var dx = b1.x - b2.x
var dy = b1.y - b2.y
var dz = b1.z - b2.z
e = e - (b1.m * b2.m) / pow(dx * dx + dy * dy + dz * dz, 0.5)
pair
to be
an instance of Pair
:
for Pair pair in PAIRS:
var b1 = pair.b1
var b2 = pair.b2
var dx = b1.x - b2.x
var dy = b1.y - b2.y
var dz = b1.z - b2.z
e = e - (b1.m * b2.m) / pow(dx * dx + dy * dy + dz * dz, 0.5)
Pair
that pair.b1
is a Body
(if
the instance variable b1
is declared that way), and that
b1.x
and others are floats. This cuts the number of dynamic
lookups from 10 to 1, so that 84% overhead is cut to 8.4%, and
extrapolating to the whole program, we might get down 5.3x C++
performance, or a 25x speedup over Python. This deserves another
table:
Language
Run Time
Relative to C++
Speedup Over Python
C++
4.0 sec
1x
131x
C++ (compiled from CSerpent)
14.7 sec
3.7x
36x
C++ (from CSerpent with object and
primitive type declarations - estimated)
21 sec
5.3x
25x
C++ (from CSerpent with mixed Any,
double, and int types)
95 sec
22x
5.8x
C++ (from CSerpent with dynamic type Any)
129 sec
32x
4.0x
Python
522 sec
131x
1x
Serpent
543 sec
136x
0.96x
nbodyref.cpp
that I suspect allow the compiler to compute
expressions like bodies[j].mass
more efficiently by
avoiding vector
templates, but this requires further
investigation.
What about Javascript?
A huge amount of effort has gone into making JavaScript and Node.js
very efficient. Using numbers for the n-body benchmark from Benchmark
Games, Node.js is about 1.45x slower than C++, which is quite
impressive. One could even imagine generating that code from something
like CSerpent if you want the feel of Python and the speed of
C++. Some reservations I have are the size: 25MB to 40MB for Node.js
binary, the instability of Node.js, with very frequent and
sometimes incompatible library updates, and the lack of real-time
guarantees for garbage collection. Still, the fact that the
just-in-time compiler can resolve JavaScript’s very dynamic
object property lookups as well as property data types in this intense
floating point
computation is very impressive and suggests that the upper limit for
Python speed is well above any current implementation.
Conclusions
b1.x
, whose types cannot be resolved at compile
time.
Caveats
All numbers should be interpreted qualitatively, especially because
everything is based on one small benchmark program. Also, keep in mind that making Python
faster may have almost zero impact on some applications where most of
the work is in libraries implemented in C++ (e.g., PyTorch
applications are largely in that category). The performance of the
promising “sweet spot” just mentioned has only been
estimated from other measurements. Reference counting overhead in my
experiments seems to be quite high, so optimization in this area could
make a significant difference. Although I am really impressed by PyPy,
my impression, considering my experiments and Node.js benchmarks, is that
PyPy could run a lot faster. The fact that it does not, and given that
PyPy developers are obviously experts, casts some doubts on everything
I have written. The final word is not in yet.