1. Preface
The goal of these notes is to introduce the reader to the following.
-
The basic concepts of parallel computing.
-
Some basic parallel algorithm design principles and techniques,
-
Real-world performance and efficiency concerns in writing parallel software and techniques for dealing with them, and
-
Parallel programming in C++.
For parallel programming in C++, we use a library, called PASL, that we have been developing over the past 5 years. The implementation of the library uses advanced scheduling techniques to run parallel programs efficiently on modern multicores and provides a range of utilities for understanding the behavior of parallel programs.
PASL stands for Parallel Algorithm Scheduling Library. It also sounds a bit like the French phrase "pas seul" (pronounced "pa-sole"), meaning "not alone".
We expect that the instructions in this book and PASL to allow the reader to write performant parallel programs at a relatively high level (essentially at the same level of C++ code) without having to worry too much about lower level details such as machine specific optimizations, which might otherwise be necessary.
All code that associated with this book can be found at the Github repository linked by the following URL:
This code-base includes the examples presented in the book, see file
minicourse/examples.hpp
.
Some of the material in this book is based on the course, 15-210, co-taught with Guy Blelloch at CMU.
This book does not focus on the design and analysis of parallel algorithms. The interested reader can find more details this topic in this book.
2. C++ Background
The material is entirely based on C++ and a library for writing parallel programs in C++. We use recent features of C++ such as closures or lambda expressions and templates. A deep understanding of these topics is not necessary to follow the course notes, because we explain them at a high level as we go, but such prior knowledge might be helpful; some pointers are provided below.
2.1. Template metaprogramming
Templates are C++'s way of providing for parametric polymorphism, which allows using the same code at multiple types. For example, in modern functional languages such as the ML family or Haskell, you can write a function $\lambda~x.x$ as an identity function that returns its argument for any type of $x$. You don’t have to write the function at every type that you plan to apply. Since functional languages such as ML and Haskell rely on type inference and have powerful type systems, they can infer from your code the most general type (within the constraints of the type system). For example, the function $\lambda~x.x$ can be given the type $\forall \alpha. \alpha \rightarrow \alpha$. This type says that the function works for any type $\alpha$ and given an argument of type $\alpha$, it returns a value of type $\alpha$.
C++ provides for polymorphism with templates. In its most basic form, a template is a class declaration or a function declaration, which is explicitly stated to be polymorphic, by making explicit the type variable. Since C++ does not in general perform type inference (in a rigorous sense of the word), it requires some help from the programmer.
For example, the following code below defines an array class that is
parametric in the type of its elements. The declaration template
<class T>
says that the declaration of class array
, which follows
is parameterized by the identifier T
. The definition of
class array
then uses T
as a type variable. For example, the
array defines a pointer to element sequences of type T
, and the
sub
function returns an element of type T
etc.
template <class T> class array { public: array (int size) {a = new T[size];} T sub (int i) { a[i];} private: *T a; }
Note that the only part of the syntax template <class T>
that is
changeable is the identifier T
. In other words, you should think of
the syntax template <class T>
as a binding form that allows you to
pick an identifier (in this case T
). You might ask why the type
identifier/variable T
is a class
. This is a good question. The
authors find it most helpful to not think much about such questions,
especially in the context of the C++ language.
Once defined a template class can be initialized with different type
variables by using the < >
syntax. For examples, we can define
different arrays such as the following.
array<int> myFavoriteNumbers(7); array<char*> myFavoriteNames(7);
Again, since C++ does not perform type inference for class instances, the C++ compiler expects the programmer to eliminate explicitly parametricity by specifying the argument type.
It is also possible to define polymorphic or generic functions. For example, the following declarations defines a generic identity function.
template <class T> T identity(T x) { return x;}
Once defined, this function can be used without explicitly specializing it at various types. In contrast to templated classes, C++ does provide some type inference for calls to templated functions. So generic functions can be specialized implicitly, as shown in the examples below.
i = identity (3) s = identity ("template programming can be ugly")
This brief summary of templates should suffice for the purposes of the material covered in this book. Templates are covered in significant detail by many books, blogs, and discussions boards. We refer the interested reader to those sources for further information.
2.2. Lambda expressions
The C++11 reference provides good documentation on lambda expressions.
3. Chapter: Fork-join parallelism
Fork-join parallelism, a fundamental model in parallel computing, dates back to 1963 and has since been widely used in parallel computing. In fork join parallelism, computations create opportunities for parallelism by branching at certain points that are specified by annotations in the program text.
Each branching point forks the control flow of the computation into two or more logical threads. When control reaches the branching point, the branches start running. When all branches complete, the control joins back to unify the flows from the branches. Results computed by the branches are typically read from memory and merged at the join point. Parallel regions can fork and join recursively in the same manner that divide and conquer programs split and join recursively. In this sense, fork join is the divide and conquer of parallel computing.
As we will see, it is often possible to extend an existing language with support for fork-join parallelism by providing libraries or compiler extensions that support a few simple primitives. Such extensions to a language make it easy to derive a sequential program from a parallel program by syntactically substituting the parallelism annotations with corresponding serial annotations. This in turn enables reasoning about the semantics or the meaning of parallel programs by essentially "ignoring" parallelism.
PASL is a C++ library that enables writing implicitly parallel
programs. In PASL, fork join is expressed by application of the
fork2()
function. The function expects two arguments: one for each
of the two branches. Each branch is specified by one C++
lambda expression.
In the sample code below, the first branch writes the value 1 into the
cell b1
and the second 2 into b2
; at the join point, the sum of
the contents of b1
and b2
is written into the cell j
.
long b1 = 0; long b2 = 0; long j = 0; fork2([&] { // first branch b1 = 1; }, [&] { // second branch b2 = 2; }); // join point j = b1 + b2; std::cout << "b1 = " << b1 << "; b2 = " << b2 << "; "; std::cout << "j = " << j << ";" << std::endl;
Output:
b1 = 1; b2 = 2; j = 3;
When this code runs, the two branches of the fork join are both run to completion. The branches may or may not run in parallel (i.e., on different cores). In general, the choice of whether or not any two such branches are run in parallel is chosen by the PASL runtime system. The join point is scheduled to run by the PASL runtime only after both branches complete. Before both branches complete, the join point is effectively blocked. Later, we will explain in some more detail the scheduling algorithms that the PASL uses to handle such load balancing and synchronization duties.
In fork-join programs, a thread is a sequence of instructions that do
not contain calls to fork2()
. A thread is essentially a piece of
sequential computation. The two branches passed to fork2()
in the
example above correspond, for example, to two independent
threads. Moreover, the statement following the join point (i.e., the
continuation) is also a thread.
Note
|
If the syntax in the code above is unfamiliar, it might be a
good idea to review the notes on lambda expressions in C++11. In a
nutshell, the two branches of fork2() are provided as
lambda-expressions where all free variables are passed by reference. |
Note
|
Fork join of arbitrary arity is readily derived by repeated application of binary fork join. As such, binary fork join is universal because it is powerful enough to generalize to fork join of arbitrary arity. |
All writes performed by the branches of the binary fork join are
guaranteed by the PASL runtime to commit all of the changes that they
make to memory before the join statement runs. In terms of our code
snippet, all writes performed by two branches of fork2
are committed
to memory before the join point is scheduled. The PASL runtime
guarantees this property by using a local barrier. Such barriers are
efficient, because they involve just a single dynamic synchronization
point between at most two processors.
In the example below, both writes into b1
and b2
are guaranteed to
be performed before the print statement.
long b1 = 0; long b2 = 0; fork2([&] { b1 = 1; }, [&] { b2 = 2; }); std::cout << "b1 = " << b1 << "; b2 = " << b2 << std::endl;
Output:
b1 = 1; b2 = 2
PASL provides no guarantee on the visibility of writes between any two
parallel branches. In the code just above, for example, writes
performed by the first branch (e.g., the write to b1
) may or may not
be visible to the second, and vice versa.
3.1. Parallel Fibonacci
Now, we have all the tools we need to describe our first parallel code: the recursive Fibonacci function. Although useless as a program because of efficiency issues, this example is the "hello world" program of parallel computing.
Recall that the $n^{th}$ Fibonnacci number is defined by the recurrence relation
with base cases
Let us start by considering a sequential algorithm. Following the definition of Fibonacci numbers, we can write the code for (inefficiently) computing the $n^{th}$ Fibonnacci number as follows. This function for computing the Fibonacci numbers is inefficient because the algorithm takes exponential time, whereas there exist dynamic programming solutions that take linear time.
long fib_seq (long n) { long result; if (n < 2) { result = n; } else { long a, b; a = fib_seq(n-1); b = fib_seq(n-2); result = a + b; } return result; }
To write a parallel version, we remark that the two recursive calls
are completely independent: they do not depend on each other
(neither uses a piece of data generated or written by another). We
can therefore perform the recursive calls in parallel. In general,
any two independent functions can be run in parallel. To indicate
that two functions can be run in parallel, we use fork2()
.
long fib_par(long n) { long result; if (n < 2) { result = n; } else { long a, b; fork2([&] { a = fib_par(n-1); }, [&] { b = fib_par(n-2); }); result = a + b; } return result; }
3.2. Incrementing an array, in parallel
Suppose that we wish to map an array to another by incrementing each
element by one. We can write the code for a function map_incr
that
performs this computation serially.
void map_incr(const long* source, long* dest, long n) { for (long i = 0; i < n; i++) dest[i] = source[i] + 1; }
map_incr
.The code below illustrates an example use of map_incr
.
const long n = 4; long xs[n] = { 1, 2, 3, 4 }; long ys[n]; map_incr(xs, ys, n); for (long i = 0; i < n; i++) std::cout << ys[i] << " "; std::cout << std::endl;
Output:
2 3 4 5
This is not a good parallel algorithm but it is not difficult to give a parallel algorithm for incrementing an array. The code for such an algorithm is given below.
void map_incr_rec(const long* source, long* dest, long lo, long hi) { long n = hi - lo; if (n == 0) { // do nothing } else if (n == 1) { dest[lo] = source[lo] + 1; } else { long mid = (lo + hi) / 2; fork2([&] { map_incr_rec(source, dest, lo, mid); }, [&] { map_incr_rec(source, dest, mid, hi); }); } }
It is easy to see that this algorithm has O(n) work and $O(\log{n})$ span.
3.3. The sequential elision
In the Fibonacci example, we started with a sequential algorithm and
derived a parallel algorithm by annotating independent functions. It
is also possible to go the other way and derive a sequential algorithm
from a parallel one. As you have probably guessed this direction is
easier, because all we have to do is remove the calls to the fork2
function. The sequential elision of our parallel Fibonacci code can be
written by replacing the call to fork2()
with a statement that
performs the two calls (arguments of fork2()
) sequentially as
follows.
long fib_par(long n) { long result; if (n < 2) { result = n; } else { long a, b; ([&] { a = fib_par(n-1); })(); ([&] { b = fib_par(n-2); })(); result = a + b; } return result; }
Note
|
Although this code is slightly different than the sequential
version that we wrote, it is not too far away, because the only the
difference is the creation and application of the lambda-expressions.
An optimizing compiler for C++ can easily "inline" such
computations. Indeed, After an optimizing compiler applies certain
optimizations, the performance of this code the same as the
performance of fib_seq . |
The sequential elision is often useful for debugging and for optimization. It is useful for debugging because it is usually easier to find bugs in sequential runs of parallel code than in parallel runs of the same code. It is useful in optimization because the sequentialized code helps us to isolate the purely algorithmic overheads that are introduced by parallelism. By isolating these costs, we can more effectively pinpoint inefficiencies in our code.
3.4. Executing fork-join algorithms
We defined fork-join programs as a subclass case of multithreaded
programs. Let’s see more precisely how we can "map" a fork-join
program to a multithreaded program. An our running example, let’s use the
map_incr_rec
, whose code is reproduced below.
void map_incr_rec(const long* source, long* dest, long lo, long hi) { long n = hi - lo; if (n == 0) { // do nothing } else if (n == 1) { dest[lo] = source[lo] + 1; } else { long mid = (lo + hi) / 2; fork2([&] { map_incr_rec(source, dest, lo, mid); }, [&] { map_incr_rec(source, dest, mid, hi); }); } }
Since, a fork-join program does not explicitly manipulate threads, it
is not immediately clear what a thread refers to. To define threads,
we can partition a fork-join computation into pieces of serial
computations, each of which constitutes a thread. What we mean by a
serial computation is a computation that runs serially and also that
does not involve any synchronization with other threads except at the
start and at the end. More specifically, for fork-join programs, we
can define a piece of serial computation a thread, if it executes
without performing parallel operations (fork2
) except perhaps as its
last action. When partitioning the computation into threads, it is
important for threads to be maximal; technically a thread can be as
small as a single instruction.
map_inc_rec
excluding the fork2
or the continuation of fork2
, which is empty, an is annotated with the interval of the input array that it operates on (its argument).The Figure above illustrates the dag for an execution
of map_incr_rec
. We partition each invocation of this function into
two threads labeled by "M" and "C" respectively. The threads labeled
by $M\lbrack i,j \rbrack $ corresponds to the part of the invocation of
map_incr_rec
with arguments lo
and hi
set to $i$ and
$j$ respectively; this first part corresponds to the part
of execution up and including the fork2
or all of the function if
this is a base case. The second corresponds to the "continuation" of
the fork2
, which is in this case includes no computation.
Based on this dag, we can create another dag, where each thread is replaced by the sequence of instructions that it represents. This would give us a picture similar to the dag we drew before for general multithreaded programs. Such a dag representation, where we represent each instruction by a vertex, gives us a direct way to calculate the work and span of the computation. If we want to calculate work and span ond the dag of threads, we can label each vertex with a weight that corresponds to the number of instruction in that thread.
Note
|
The term thread is very much overused in computer science. There are system threads, which are threads that are known to the operating system and which can perform a variety of operations. For example, Pthreads library enables creating such system threads and programming with them. There are also many libraries for programming with user-level threads, which are threads that exist at the application level but the operating system does not know about them. Then there are threads that are much more specific such as those that we have defined for the fork-join programs. Such threads can be mapped to system or user-level threads but since they are more specific, they are usually implemented in a custom fashion, usually in the user/application space. For this reason, some authors prefer to use a different term for such threads, e.g., spark, strand, task. |
Let’s observe a few properties of fork-join computations and their dags.
-
The computation dag of a fork-join program applied to an input unfolds dynamically as the program executes. For example, when we run
map_inc_rec
with an input with $n$ elements, the dag initially contains just the root vertex (thread) corresponding to the first call tomap_inc_rec
but it grows as the execution proceeds. -
An execution of a fork-join program can generate a massive number of threads. For example, our ‘map_inc_rec’ function generates approximately $4n$ threads for an input with $n$ elements.
-
The work/span of each thread can vary from a small amount to a very large amount depending on the algorithm. In our example, each thread performs either a conditional, sometimes an addition and a fork operation or performs no actual computation (continuation threads).
Suppose now we are given a computation dag and we wish to execute the dag by mapping each thread to one of the $P$ processor that is available on the hardware. To do this, we can use an online scheduling algorithm.
The following is a schedule for the dag shown in this Figure assuming that each thread takes unit time.
Time Step | Processor 1 | Processor 2 |
---|---|---|
1 |
M [0,8) |
|
2 |
M [0,4) |
M [4,8) |
3 |
M [0,2) |
M [4,6) |
4 |
M [0,1) |
M [4,5) |
5 |
M [1,2) |
M [5,6) |
6 |
C [0,2) |
C [4,6) |
7 |
M [2,4) |
M [6,8) |
8 |
M [2,3) |
M [6,7) |
9 |
M [3,4) |
M [7,8) |
10 |
C [2,4) |
C [6,8) |
11 |
C [0,4) |
C [4,8) |
12 |
C [0,8) |
_ |
4. Race Conditions
A race condition is any behavior in a program that is determined by some feature of the system that cannot be controlled by the program, such as timing of the execution of instructions. In PASL, a race condition can occur whenever two parallel branches access the same location in memory and at least one of the two branches performs a write. When this situation occurs, the behavior of the program may be undefined, leading to (often) buggy behavior. We used the work "may" here because certain programs use race conditions in a careful way as a means to improve performance. Special attention to race conditions is crucial because race conditions introduce especially pernicious form error that can confound beginners and experts alike.
Race conditions can be highly problematic because, owing to their nondeterministic behavior, they are often extremely hard to debug. To make matters worse, it is also quite easy to create race conditions without even knowing it.
In the code below, both branches of fork2
are writing into b
.
What should then the output of this program be?
long b = 0; fork2([&] { b = 1; }, [&] { b = 2; }); std::cout << "b = " << std::endl;
At the time of the print, the contents of b
is determined by the
last write. Thus depending on which of the two branches perform the
write, we can see both possibilities:
Output:
b = 1
Output:
b = 2
Consider the following alternative implementation of the Fibonacci
function. By "inlining" the plus operation in both branches, the
programmer got rid of the addition operation after the fork2
.
long fib_par_racy(long n) { long result = 0; if (n < 2) { result = n; } else { fork2([&] { result += fib_par_racy(n-1); },[&] { result += fib_par_racy(n-2); }); } return result; }
This code is not correct because it has a race condition.
As in the example shows, separate threads are updating the value
result but it might look like this is not a race condition because the
update consists of an addition operation, which reads the value and
then writes to i. The race condition might be easier to see if we
expand out the applications of the +=
operator.
long fib_par_racy(long n) { long result = 0; if (n < 2) { result = n; } else { fork2([&] { long a1 = fib_par_racy(n-1); long a2 = result; result = a1 + a2; },[&] { long b1 = fib_par_racy(n-2); long b2 = result; result = b1 + b2; }); } return result; }
When written in this way, it is clear that these two parallel threads
are not independent: they both read result
and write to
result
. Thus the outcome depends on the order in which these reads
and writes are performed, as shown in the next example.
The following table takes us through one possible execution trace of
the call fib_par_racy(2)
. The number to the left of each instruction
describes the time at which the instruction is executed. Note that
since this is a parallel program, multiple instructions can be
executed at the same time. The particular execution that we have in
this example gives us a bogus result: the result is 0, not 1 as it
should be.
Time step | Thread 1 | Thread 2 |
---|---|---|
1 |
a1 = fib_par_racy(1) |
b2 = fib_par_racy(0) |
2 |
a2 = result |
b3 = result |
3 |
result = a1 + a2 |
_ |
4 |
_ |
result = b1 + b2 |
The reason we get a bogus result is that both threads read the initial
value of result at the same time and thus do not see each others
write. In this example, the second thread "wins the race" and writes
into result
. The value 1 written by the first thread is effectively
lost by being overwritten by the second thread.
4.1. Synchronization Hardware
Since mutual exclusion is a common problem in computer science, many hardware systems provide specific synchronization operations that can help solve instances of the problem. These operations may allow, for example, testing the contents of a (machine) word then modifying it, perhaps by swapping it with another word. Such operations are sometimes called atomic read-modify-write or RMW, for short, operations.
A handful of different RMW operations have been proposed. They
include operations such as load-link/store-conditional,
fetch-and-add, and compare-and-swap. They typically take the
memory location x
, and a value v
and replace the value of stored
at x
with f(x,v)
. For example, the fetch-and-add operation takes
the location x
and the increment-amount, and atomically increments
the value at that location by the specified amount, i.e., f(x,v) = *x
+ v
.
The compare-and-swap operation takes the location x
and takes a pair
of values (a,b)
as the second argument, and stores b
into x
if
the value in x
is a
, i.e., f(x,(a,b)) = if *x = a then b else a
;
the operation returns a boolean indicating whether the operation
successfully stored a new value in x
. The operation
"compare-and-swap" is a reasonably powerful synchronization operation:
it can be used by arbitrarily many threads to agree (reach consensus)
on a value. This instruction therefore is frequently provided by
modern parallel architectures such as Intel’s X86.
In C$++$, the atomic
class can be used to perform synchronization.
Objects af this type are guarantee to be free of race conditions; and
in fact, in C++, they are the only objects that are guaranteed to be
free from race conditions. The contents of an atomic
object can be
accessed by load
opeations, updated by store
operation, and also
updated by compare_exchange_weak
and compare_exchange_strong
operations, the latter of which implement the compare-and-swap
operation.
Access to the contents of any given cell is achieved by the load()
and store()
methods.
std::atomic<bool> flag; flag.store(false); std::cout << flag.load() << std::endl; flag.store(true); std::cout << flag.load() << std::endl;
Output:
0
1
The key operation that help with race conditions is the compare-and-exchange operation.
std::atomic<bool> flag; flag.store(false); bool expected = false; bool was_successful = flag.compare_exchange_strong(expected, true); std::cout << "was_successful = " << was_successful << "; flag = " << flag.load() << std::endl; bool expected2 = false; bool was_successful2 = flag.compare_exchange_strong(expected2, true); std::cout << "was_successful2 = " << was_successful2 << "; flag = " << flag.load() << std::endl;
Output:
was_successful = 1; flag = 1
was_successful2 = 0; flag = 1
As another example use of the atomic
class, recall our Fibonacci
example with the race condition. In that example, race condition
arises because of concurrent writes to the result
variable. We can
eliminate this kind of race condition by using different memory
locations, or by using an atomic class and using a
compare_exchange_strong
operation.
The following implementation of Fibonacci is not safe because the
variable result
is shared and updated by multiple threads.
long fib_par_racy(long n) { long result = 0; if (n < 2) { result = n; } else { fork2([&] { result += fib_par_racy(n-1); },[&] { result += fib_par_racy(n-2); }); } return result; }
We can solve this problem by declaring result
to be an atomic type
and using a standard busy-waiting protocol based on compare-and-swap.
long fib_par_atomic(long n) { atomic<long> result = 0; if (n < 2) { result.store(n); } else { fork2([&] { long r = fib_par_racy(n-1); // Atomically update result. while (true) { long exp = result.load(); bool flag = result.compare_exchange_strong(exp,exp+r) if (flag) {break;} } },[&] { long r = fib_par_racy(n-2); // Atomically update result. while (true) { long exp = result.load(); bool flag = result.compare_exchange_strong(exp,exp+r) if (flag) {break;} } }); } return result; }
The idea behind the solution is to load the current value of result
and atomically update result
only if it has not been modified (by
another thread) since it was loaded. This guarantees that the
result
is always updated (read and modified) correctly without
missing an update from another thread.
The example above illustrates a typical use of the compare-and-swap operation. In this particular example, we can probably prove our code is correct. But this is not always as easy due to a problem called the "ABA problem."
4.2. ABA problem
While reasonably powerful, compare-and-swap suffers from the so-called
ABA problem. To see this consider the following scenario where a
shared variable result
is update by multiple threads in parallel: a
thread, say $T$, reads the result
and stores its current
value, say 2
, in current
. In the mean time some other thread also
reads result
and performs some operations on it, setting it back to
2
after it is done. Now, thread $T$ takes its turn
again and attempts to store a new value into result
by using 2
as
the old value and being successful in doing so, because the value
stored in result
appears to have not changed. The trouble is that
the value has actually changed and has been changed back to the value
that it used to be. Thus, compare-and-swap was not able to detect
this change because it only relies on a simple shallow notion of
equality. If for example, the value stored in result
was a pointer,
the fact that the pointer remains the same does not mean that values
accessible from the pointer has not been modified; if for example, the
pointer led to a tree structure, an update deep in the tree could
leave the pointer unchanged, even though the tree has changed.
This problem is called the ABA problem, because it involves cycling the atomic memory between the three values $A$, $B$, and again $A$). The ABA problem is an important limitation of compare-and-swap: the operation itself is not atomic but is able to behave as if it is atomic if it can be ensured that the equality test of the subject memory cell suffices for correctness.
In the example below, ABA problem may happen (if the counter is incremented and decremented again in between a load and a store) but it is impossible to observe because it is harmless. If however, the compare-and-swap was on a memory object with references, the ABA problem could have had observable effects.
The ABA problem can be exploited to give seemingly correct implementations that are in fact incorrect. To reduce the changes of bugs due to the ABA problem, memory objects subject to compare-and-swap are usually tagged with an additional field that counts the number of updates. This solves the basic problem but only up to a point because the counter itself can also wrap around. The load-link/store-conditional operation solves this problem by performing the write only if the memory location has not been updated since the last read (load) but its practical implementations are hard to come by.
5. Chapter: Executing parallel algorithms
Implicit parallelism allows writing parallel programs at a high level
of abstraction. In this section, we discoss techniques for executing
such parallel programs on hardware-shared-memory computers such as
multicore computers. As our running example, we use the
map_incr_rec
, whose code is reproduced below.
void map_incr_rec(const long* source, long* dest, long lo, long hi) { long n = hi - lo; if (n == 0) { // do nothing } else if (n == 1) { dest[lo] = source[lo] + 1; } else { long mid = (lo + hi) / 2; fork2([&] { map_incr_rec(source, dest, lo, mid); }, [&] { map_incr_rec(source, dest, mid, hi); }); } }
The basic idea is to partition a computation, that is a run of a parallel algorithm on a specified input, into pieces of serial computations, called threads, and map them to available processors while observing the dependencies between them. The task of mapping the threads to available processors is called thread scheduling or simply scheduling.
We call a piece of serial computation a thread, if it executes
without performing parallel operations (fork2
) except perhaps as
its last action. The term thread is short for user-level thread
(as opposed to a operating-system thread). When partitioning the
computation into threads, it is important for threads to be maximal
to minimize scheduling overhead (technically a thread can be as small
as a sincle instruction).
When scheduling a parallel computation, it is important that we don’t
alter the intended meaning of the computation. Specifically, if a
thread depends another thread, because for example, it reads a piece
of data generated by the latter, it cannot be executed before the
thread that it depends on. We can conservatively approximate such
dependencies by observing the fork2
expressions and by organizing
dependencies consistently with them. More specifically, we can
represent a computations as a graph where each vertex represents a
thread and each edge represents a dependency. Vertices and edges are
created by execution of fork2
. Each fork2
creates two threads
(vertices) corresponding to the two branches and inserts an edge
between each branch and the thread that performs the fork2
branches;
in addition, each fork2
creates a join or continuation thread
(vertex) that depends on the two branches. Since such a graph cannot
contain cycles, it is a Directed Acyclic Graph (DAG).
map_inc_rec
excluding the fork2
or the continuation of fork2
, which is empty, an is annotated with the interval of the input array that it operates on (its argument).Figure [fig:parallel-inc-dag] illustrates the DAG for an execution
of map_incr_rec
. We partition each invocation of this function into
two threads labeled by "M" and "C" respectively. The threads labeled
by $M \lbrack i,j \rbrack$ corresponds to the part of the invocation of
map_incr_rec
with arguments lo
and hi
set to $i$ and
$j$ respectively; this first part corresponds to the part
of execution up and including the fork2
or all of the function if
this is a base case. The second corresponds to the "continuation" of
the fork2
, which is in this case includes no computation.
There is an important connection between computation DAG’s and work and span. Suppose that we assign to each vertex a weight of at least $1$ that corresponds to the work of that vertex (since threads are serial work and span for each vertex is the same). We can then calculate the total weight and total depth of the DAG by summing up weights. The total weight of the DAG corresponds to the work of the computation and the depth corresponds to its span. In our example, each vertex has weight O(1). Thus for an array with n elements, the total weight (work) is O(n) and the depth (span) is $O(\log{n})$.
Having talked about DAG’s we are now ready to talk about how to map parallel computations to actual hardware so as to minimize their run-time, i.e., scheduling. But before we move on to scheduling let us observe a few properties of implicitly parallel computations.
-
The computation DAG of a parallel algorithm applied to an input unfolds dynamically as the algorithm executes. For example, when we run
map_inc_rec
with an input with $n$ elements, the DAG initially contains just the root vertex (thread) corresponding to the first call tomap_inc_rec
but it grows as the execution proceeds. -
An execution of a parallel algorithm can generate a massive number of threads. For example, our ‘map_inc_rec’ function generates approximately $4n$ threads for an input with $n$ elements.
-
The work/span of each thread can vary from a small amount to a very large amount depending on the algorithm. In our example, each thread performs either a conditional, sometimes an addition and a fork operation or performs no actual computation (continuation threads).
Suppose now we are given a computation DAG and we wish to execute the DAG by mapping each thread to one of the $P$ processor that is available on the hardware. When a thread is mapped to a processor, it will be executed requiring time proportional to the work (weight) of the thread. Once the processor completes the thread, we can map another thread to the processor, so that the processor does not idle unnecessarily. As we map threads to processors, it is important that we observe the dependencies between threads, i.e., we don’t execute a thread before its parents.
The goal of scheduling to minimize critical resources such as time. Computing the shortest schedule for a DAG turns out to be highly nontrivial. In fact, the related decision problem in NP-complete. It is possible, however, to give a good approximation algorithm for the offline version of the problem to generate a 2-factor approximation. Such an appraximation yields a schedule for a given DAG within a factor-two of the shortest schedule. In the online version of the problem, where the DAG unfolds as the computation executes, we don’t know the DAG a priori and we have to account for the costs for scheduling such as migrating threads between schedulers and finding work. To execute parallel programs, we need an solution to this online version of the problem.
An online scheduler or a simply a scheduler is an algorithm that performs scheduling by mapping threads to available processors. For example, if only one processor is available, a scheduler can map all threads to that one processor. If two processors are available, then the scheduler can divide the threads between the two processors as evenly as possible in an attempt to keep the two processors as busy as possible by load balancing.
The following is a valid schedule for the DAG shown in this Figure assuming that each thread takes unit time.
Time Step | Processor 1 | Processor 2 |
---|---|---|
1 |
M [0,8) |
|
2 |
M [0,4) |
M [4,8) |
3 |
M [0,2) |
M [4,6) |
4 |
M [0,1) |
M [4,5) |
5 |
M [1,2) |
M [5,6) |
6 |
C [0,2) |
C [4,6) |
7 |
M [2,4) |
M [6,8) |
8 |
M [2,3) |
M [6,7) |
9 |
M [3,4) |
M [7,8) |
10 |
C [2,4) |
C [6,8) |
11 |
C [0,4) |
C [4,8) |
12 |
C [0,8) |
_ |
We say that a scheduler is greedy if, whenever there is a processor available and a thread ready to be executed, then the scheduler assigns the thread to the processor and starts running the thread immediately. Greedy schedulers have a nice property that is summarized by the following theorem.
This simple statement is powerful. To see this, note that the time to execute the computation is at least $\frac{W}{P}$ because we have a total of $W$ work. As such, the best possible execution strategy is to divide it evenly among the processors. Furthermore, execution time cannot be less than $S$ since $S$ represents the longest chain of sequential dependencies. Thus we have: $ T_P \geq \max\left(\frac{W}{P},S\right). $
This means that a greedy schudeler yields a schedule that is within a factor two of optimal: $\frac{W}{P} + S$ is never more than twice $\max(\frac{W}{P},S)$. Furthermore, when $\frac{W}{P} \gg S$, the difference between the greedy scheduler and the optimal scheduler is very small. In fact, we can rewrite equation above in terms of the average parallelism $\mathbb{P} = W/S$ as follows:
Therefore as long as $\mathbb{P} \gg P$ (the parallelism is much greater than the number of processors), then we obtain near perfect speedup. (Speedup is $W/T_p$ and perfect speedup would be $p$).
The quantity $\mathbb{P}$, sometimes called average parallelism, is usually quite large, because it usually grows polynomially with the input size.
We can give a simple greedy scheduler by using a queue of threads. At the start of the execution, the scheduler places the root of the DAG into the queue and then repeats the following step until the queue becomes empty: for each idle processor, take the vertex at the front of the queue and assign it to the processor, let each processor run for one step, if at the end of the step, there is a vertex in the DAG whose parents have all completed their execution, then insert that vertex at the tail of the queue.
The centralized scheduler with the global thread queue is a greedy scheduler that generates a greedy schedule under the assumption that the queue operations take zero time and that the DAG is given. This algorithm, however, does not work well for online scheduling the operations on the queue take time. In fact, since the thread queue is global, the algorithm can only insert and remove one thread at a time. For this reason, centralized schedulers do not scale beyond a handful of processors.
There has been much research on the problem of reducing friction in scheduling. This research shows that distrubuted scheduling algorithms can work quite well. In a distributed algorithm, each processor has its own queue and primarily operates on its own queue. A load-balancing technique is then used to balance the load among the existing processors by redistributing threads, usually on a needs basis. This strategy ensures that processors can operate in parallel to obtain work from their queues.
A specific kind of distributed scheduling technique that can leads to schedules that are close to optimal is work stealing schedulers. In a work-stealing scheduler, processors work on their own queues as long as their is work in them, and if not, go "steal" work from other processors by removing the thread at the tail end of the queue. It has been proven that randomized work-stealing algorithm, where idle processors randomly select processors to steal from, deliver close to optimal schedules in expectation (in fact with high probability) and furthermore incur minimal friction. Randomized schedulers can also be implemented efficiently in practice. PASL uses an scheduling algorithm that is based on work stealing.
6. Chapter: Experimenting with PASL
We are now going to study the practical performance of our parallel algorithms written with PASL on multicore computers.
To be concrete with our instructions, we assume that our
username is pasl
and that our home directory is /home/pasl/
. You
need to replace these settings with your own where appropriate.
6.1. Obtain source files
Let’s start by downloading the PASL sources. The PASL sources that we
are going to use are part of a branch that we created specifically for
this course. You can access the sources either via the tarball linked
by the github webpage
or, if you have git
, via the command below.
$ cd /home/pasl
$ git clone -b edu https://github.com/deepsea-inria/pasl.git
6.2. Software Setup
You can skip this section if you are using a computer already setup by us or you have installed an image file containing our software. To skip this part and use installed binaries, see the heading "Starting with installed binaries", below.
6.2.1. Check for software dependencies
Currently, the software associated with this course supports Linux only. Any machine that is configured with a recent version of Linux and has access to at least two processors should be fine for the purposes of this course. Before we can get started, however, the following packages need to be installed on your system.
Software dependency | Version | Nature of dependency |
---|---|---|
>= 4.9.0 |
required to build PASL binaries |
|
>= 5.3.10 |
required by PASL makefiles to build PASL binaries |
|
>= 4.0.0 |
required to build the benchmarking tools (i.e., pbench and pview) |
|
>= 2.4.1 |
required by benchmarking tools to generate reports in bar plot and scatter plot form |
|
|
recent |
optional; required by benchmarking tools to generate reports in tabular form |
recent |
optional; can be used to access PASL source files |
|
>= 2.2 |
optional; may be useful to improve performance of PASL binaries |
|
recent |
optional; might be useful to improve performance on large systems with NUMA (see below) |
The rest of this section explains what are the optional software
dependencies and how to configure PASL to use them. We are going to
assume that all of these software dependencies have been installed in
the folder /home/pasl/Installs/
.
6.2.2. Use a custom parallel heap allocator
At the time of writing this document, the system-default
implementations of malloc
and free
that are provided by Linux
distributions do not scale well with even moderately large amounts of
concurrent allocations. Fortunately, for this reason, organizations,
such as Google and Facebook, have implemented their own scalable
allocators that serve as drop-in replacements for malloc
and
free
. We have observed the best results from Google’s allocator,
namely,
tcmalloc. Using
tcmalloc for your own experiements is easy. Just add to the
/home/pasl/pasl/minicourse
folder a file named settings.sh
with
the following contents.
We assume that the package that contains tcmalloc
, namely
gperftools
, has been installed already in the folder
/home/pasl/Installs/gperftools-install/
. The following lines need
to be in the settings.sh
file in the /home/pasl/pasl/minicourse
folder.
USE_ALLOCATOR=tcmalloc
TCMALLOC_PATH=/home/pasl/Installs/gperftools-install/lib/
Also, the environment linkder needs to be instructed where to find
tcmalloc
.
export LD_PRELOAD=/home/pasl/Installs/gperftools-install/lib/libtcmalloc.so
This assignment can be issued either at the command line or in the
environment loader script, e.g., ~/.bashrc
.
Warning
|
Changes to the settings.sh file take effect only after
recompiling the binaries. |
6.2.3. Use hwloc
If your system has a non-uniform memory architecture (i.e., NUMA),
then you may improve performance of PASL applications by using
optional support for hwloc
, which is a library that reports detailed
information about the host system, such as NUMA layout. Currently,
PASL leverages hwloc
to configure the NUMA allocation policy for the
program. The particular policy that works best for our applications is
round-robin NUMA page allocation. Do not worry if that term is
unfamiliar: all it does is disable NUMA support, anyway!
Run the following command.
$ dmesg | grep -i numa
If the output that you see is something like the following, then your machine has NUMA. Otherwise, it probably does not.
[ 0.000000] NUMA: Initialized distance table, cnt=8
[ 0.000000] NUMA: Node 4 [0,80000000) + [100000000,280000000) -> [0,280000000)
We are going to assume that hwloc
has been installed already and is
located at /home/pasl/Installs/hwloc-install/
. To configure PASL to
use hwloc
, add the following lines to the settings.sh
file in the
/home/pasl/pasl/minicourse
folder.
hwloc
USE_HWLOC=1
HWLOC_PATH=/home/pasl/Installs/hwloc-install/
6.3. Starting with installed binaries
At this point, you have either installed all the necessary software to
work with PASL or these are installed for you. In either case, make
sure that your PATH
variable makes the software visible. For
setting up your PATH
variable on andrew.cmu domain, see below.
6.3.1. Specific set up for the andrew.cmu domain
We have installed much of the needed software on andrew.cmu.edu. So you need to go through a relatively minimal set up.
First set up your PATH
variable to refer to the right
directories. Using cshell
setenv PATH /opt/rh/devtoolset-3/root/usr/bin:/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:./
The part added to the default PATH on andrew is
/opt/rh/devtoolset-3/root/usr/bin
It is important that this is at the beginning of the PATH
variable.
To make interaction easier, we also added the relative path ./
to
the PATH
variable.
6.3.2. Fetch the benchmarking tools (pbench)
We are going to use two command-line tools to help us to run experiments and to analyze the data. These tools are part of a library that we developed, which is named pbench. The pbench sources are available via github.
$ cd /home/pasl
$ git clone https://github.com/deepsea-inria/pbench.git
The tarball of the sources can be downloaded from the github page.
6.3.3. Build the tools
The following command builds the tools, namely prun
and pplot
. The
former handles the collection of data and the latter the
human-readable output (e.g., plots, tables, etc.).
$ make -C /home/pasl/pbench/
Make sure that the build succeeded by checking the pbench directory
for the files prun
and pplot
. If these files do not appear, then
the build failed.
6.3.4. Create aliases
We recommend creating the following aliases.
$ alias prun '/home/pasl/pbench/prun'
$ alias pplot '/home/pasl/pbench/pplot'
It will be convenient for you to make these aliases persistent, so that next time you log in, the aliases will be set. Add the commands above to your shell configuration file.
6.3.5. Visualizer Tool
When we are tuning our parallel algorithms, it can be helpful to visualize their processor utilization over time, just in case there are patterns that help to assign blame to certain regions of code. Later, we are going to use the utilization visualizer that comes packaged along with PASL. To build the tool, run the following make command.
$ make -C /home/pasl/pasl/tools/pview pview
Let us create an alias for the tool.
$ alias pview '/home/pasl/pasl/tools/pview/pview'
We recommend that you make this alias persistent by putting it into your shell configuration file (as you did above for the pbench tools).
6.4. Using the Makefile
PASL comes equipped with a Makefile
that can generate several
different kinds of executables. These different kinds of executables
and how they can be generated is described below for a benchmark
program pgm
.
-
baseline: build the baseline with command
make pgm.baseline
-
elision: build the sequential elision with command
make pgm.elision
-
optimized: build the optimized binary with command
make pgm.opt
-
log: build the log binary with command
make pgm.log
-
debug: build the debug binary with the command
make pgm.dbg
To speed up the build process, add to the make
command the option
-j
(e.g., make -j pgm.opt
). This option enables make
to
parallelize the build process. Note that, if the build fails, the
error messages that are printed to the terminal may be somewhat
garbled. As such, it is better to use -j
only if after the debugging
process is complete.
6.5. Task 1: Run the baseline Fibonacci
We are going to start our experimentation with three different
instances of the same program, namely bench
. This program serves as
a "driver" for the benchmarks that we have implemented. These
implementations are good parallel codes that we expect to deliver good
performance. We first build the baseline version.
$ cd /home/pasl/pasl/minicourse
$ make bench.baseline
Warning
|
The command-line examples that we show here assume that you
have . in your $PATH . If not, you may need to prefix command-line
calls to binaries with ./ (e.g., ./bench.baseline ). |
The file extension .baseline
means that every benchmark in the
binary uses the sequential-baseline version of the specified
algorithm.
We can now run the baseline for one of our benchmarks, say Fibonacci
by using the -bench
argument to specify the benchmark and the -n
argument to specify the input value for the Fibonacci function.
$ bench.baseline -bench fib -n 39
On our machine, the output of this run is the following.
exectime 0.556
utilization 1.0000
result 63245986
The three lines above provide useful information about the run.
-
The
exectime
indicates the wall-clock time in seconds that is taken by the benchmark. In general, this time measures only the time taken by the benchmark under consideration. It does not include the time taken to generate the input data, for example. -
The
utilization
relates to the utilization of the processors available to the program. In the present case, for a single-processor run, the utilization is by definition 100%. We will return to this measure soon. -
The
result
field reports a value computed by the benchmark. In this case, the value is the $39^{th}$ Fibonacci number.
6.6. Task 2: Run the sequential elision of Fibonacci
The .elision
extension means that parallel algorithms (not
sequential baseline algorithms) are compiled. However, all instances
of fork2()
are erased as described in an earlier chapter.
$ make bench.elision
$ bench.elision -bench fib -n 39
The run time of the sequential elision in this case is similar to the run time of the sequential baseline because the two are similar codes. However, for most other algorithms, the baseline will typically be at least a little faster.
exectime 0.553
utilization 1.0000
result 63245986
6.7. Task 3: Run parallel Fibonacci
The .opt
extension means that the program is compiled with full
support for parallel execution. Unless specified otherwise, however,
the parallel binary uses just one processor.
$ make bench.opt
$ bench.opt -bench fib -n 39
The output of this program is similar to the output of the previous two programs.
exectime 0.553
utilization 1.0000
result 63245986
Because our machine has 40 processors, we can run the same application
using all available processors. Before running this command, please
adjust the -proc
option to match the number of cores that your
machine has. Note that you can use any number of cores up to the
number you have available. You can use nproc
or lscpu
to
determine the number of cores your machine has.
$ bench.opt -bench fib -n 39 -proc 40
We see from the output of the 40-processor run that our program ran
faster than the sequential runs. Moreover, the utilization
field
tells us that approximately 86% of the total time spent by the 40
processors was spent performing useful work, not idling.
exectime 0.019
utilization 0.8659
result 63245986
Warning
|
PASL allows the user to select the number of processors by
the -proc key. The maximum value for this key is the number of
processors that are available on the machine. PASL raises an error if
the programmer asks for more processors than are available. |
6.8. Measuring performance with "speedup"
We may ask at this point: What is the improvement that we just observed from the parallel run of our program? One common way to answer this question is to measure the "speedup".
Important
|
The importance of selecting a good baseline
Note that speedup is defined with respect to a baseline program. How exactly should this baseline program be chosen? One option is to take the sequential elision as a baseline. The speedup curve with such a baseline can be helpful in determining the scalability of a parallel algorithm but it can also be misleading, especially if speedups are taken as a indicator of good performance, which they are not because they are only relative to a specific baseline. For speedups to be a valid indication of good performance, they must be calculated against an optimized implementation of the best serial algorithm (for the same problem.) |
The speedup at a given number of processors is a good starting point on the way to evaluating the scalability of the implementation of a parallel algorithm. The next step typically involves considering speedups taken from varying numbers of processors available to the program. The data collected from such a speedup experiment yields a speedup curve, which is a curve that plots the trend of the speedup as the number of processors increases. The shape of the speedup curve provides valuable clues for performance and possibly for tuning: a flattening curve suggests lack of parallelism; a curve that arcs up and then downward suggests that processors may be wasting time by accessing a shared resource in an inefficient manner (e.g., false sharing); a speedup curve with a constant slope indicates at least some scaling.
The speedup $T_B/T_{40}$ equals $0.556/0.019 = 29.26$x. Although not linear (i.e., 40x), this speedup is decent considering factors such as: the capabilities of our machine; the overheads relating to parallelism; and the small size of the problem compared to the computing power that our machine offers.
6.8.1. Generate a speedup plot
Let us see what a speedup curve can tell us about our parallel Fibonacci program. We need to first get some data. The following command performs a sequence of runs of the Fibonacci program for varying numbers of processors. You can now run the command yourself.
$ prun speedup -baseline "bench.baseline" -parallel "bench.opt -proc 1,10,20,30,40" -bench fib -n 39
Here is another example on a 24-core machine.
$ prun speedup -baseline "bench.baseline" -parallel "bench.opt -proc 1,4,8,16,24" -bench fib -n 39
Run the following command to generate the speedup plot.
$ pplot speedup
If successful, the command generates a file named plots.pdf
. The
output should look something like the plot in speedup plot below.
Starting to generate 1 charts.
Produced file plots.pdf.
The plot shows that our Fibonacci application scales well, up to about twenty processors. As expected, at twenty processors, the curve dips downward somewhat. We know that the problem size is the primary factor leading to this dip. How much does the problem size matter? The speedup plot in the Figure below shows clearly the trend. As our problem size grows, so does the speedup improve, until at the calculation of the $45^{th}$ Fibonacci number, the speedup curve is close to being linear.
Note
|
The prun and pplot tools have many more features than those
demonstrated here. For details, see the documentation provided with
the tools in the file named README.md . |
Warning
|
Noise in experiments
The run time that a given parallel program takes to solve the
same problem can vary noticeably because of certain effects that are
not under our control, such as OS scheduling, cache effects, paging,
etc. We can consider such noise in our experiments random noise. Noise
can be a problem for us because noise can lead us to make incorrect
conclusions when, say, comparing the performance of two algorithms
that perform roughly the same. To deal with randomness, we can perform
multiple runs for each data point that we want to measure and consider
the mean over these runs. The |
6.8.2. Superlinear speedup
Suppose that, on our 40-processor machine, the speedup that we observe is larger than 40x. It might sound improbable or even impossible. But it can happen. Ordinary circumstances should preclude such a superlinear speedup, because, after all, we have only forty processors helping to speed up the computation. Superlinear speedups often indicate that the sequential baseline program is suboptimal. This situation is easy to check: just compare its run time with that of the sequential elision. If the sequential elision is faster, then the baseline is suboptimal. Other factors can cause superlinear speedup: sometimes parallel programs running on multiple processors with private caches benefit from the larger cache capacity. These issues are, however, outside the scope of this course. As a rule of thumb, superlinear speedups should be regarded with suspicion and the cause should be investigated.
6.9. Visualize processor utilization
The 29x speedup that we just calculated for our Fibonacci benchmark was a little dissapointing, and the 86% processor utilization of the run left 14% utilization for improvement. We should be suspicious that, although seemingly large, the problem size that we chose, that is, $n = 39$, was probably a little too small to yield enough work to keep all the processors well fed. To put this hunch to the test, let us examine the utilization of the processors in our system. We need to first build a binary that collects and outputs logging data.
$ make bench.log
We run the program with the new binary in the same fashion as before.
$ bench.log -bench fib -proc 40 -n 39
The output looks something like the following.
exectime 0.019
launch_duration 0.019
utilization 0.8639
thread_send 205
thread_exec 4258
thread_alloc 2838
utilization 0.8639
result 63245986
We need to explain what the new fields mean.
-
The
thread_send
field tells us that 233 threads were exchaged between processors for the purpose of load balancing; -
the
thread_exec
field that 5179 threads were executed by the scheduler; -
the
thread_alloc
field that 3452 threads were freshly allocated.
Each of these fields can be useful for tracking down inefficiencies. The number of freshly allocated threads can be a strong indicator because in C++ thread allocation costs can sometimes add up to a significant cost. In the present case, however, none of the new values shown above are highly suspicious, considering that there are all at most in the thousands.
Since we have not yet found the problem, let us look at the
visualization of the processor utilization using our pview
tool. To
get the necessary logging data, we need to run our program again, this
time passing the argument --pview
.
$ bench.log -bench fib -n 39 -proc 40 --pview
When the run completes, a binary log file named LOG_BIN
should be
generated in the current directory. Every time we run with --pview
this binary file is overwritten. To see the visualization of the log
data, we call the visualizer tool from the same directory.
$ pview
The output we see on our 40-processor machine is shown in the Figure below. The window shows one bar per processor. Time goes from left to right. Idle time is represented by red and time spent busy with work by grey. You can zoom in any part of the plot by clicking on the region with the mouse. To reset to the original plot, press the space bar. From the visualization, we can see that most of the time, particularly in the middle, all of the processors keep busy. However, there is a lot of idle time in the beginning and end of the run. This pattern suggests that there just is not enough parallelism in the early and late stages of our Fibonacci computation.
6.10. Strong versus weak scaling
We are pretty sure that or Fibonacci program is not scaling as well is it could. But poor scaling on one particular input for $n$ does not necessarily mean there is a problem with the scalability our parallel Fibonacci program in general. What is important is to know more precisely what it is that we want our Fibonacci program to achieve. To this end, let us consider a distinction that is important in high-performance computing: the distinction between strong and weak scaling. So far, we have been studying the strong-scaling profile of the computation of the $39^{th}$ Fibonacci number. In general, strong scaling concerns how the run time varies with the number of processors for a fixed problem size. Sometimes strong scaling is either too ambitious, owing to hardware limitations, or not necessary, because the programmer is happy to live with a looser notion of scaling, namely weak scaling. In weak scaling, the programmer considers a fixed-size problem per processor. We are going to consider something similar to weak scaling. In the Figure below, we have a plot showing how processor utilization varies with the input size. The situation dramatically improves from 12% idle time for the $39^{th}$ Fibonacci number down to 5% idle time for the $41^{st}$ and finally to 1% for the $45^{th}$. At just 1% idle time, the utilization is excellent.
The scenario that we just observed is typical of multicore systems. For computations that perform relatively little work, such as the computation of the $39^{th}$ Fibonacci number, properties that are specific to the hardware, OS, and PASL load-balancing algorithm can noticeably limit processor utilization. For computations that perform lots of highly parallel work, such limitations are barely noticeable, because processors spend most of their time performing useful work. Let us return to the largest Fibonacci instance that we considered, namely the computation of the $45^{th}$ Fibonacci number, and consider its utilization plot.
$ bench.log -bench fib -n 45 -proc 40 --pview
$ pview
The utilization plot is shown in the Figure below. Compared the to utilization plot we saw in the Figure above for n=39, the red regions are much less prominent overall and the idle regions at the beginning and end are barely noticeable.
6.11. Chapter Summary
We have seen in this lab how to build, run, and evaluate our parallel programs. Concepts that we have seen, such as speedup curves, are going to be useful for evaluating the scalability of our future solutions. Strong scaling is the gold standard for a parallel implementation. But as we have seen, weak scaling is a more realistic target in most cases.
7. Chapter: Work efficiency
In many cases, a parallel algorithm which solves a given problem performs more work than the fastest sequential algorithm that solves the same problem. This extra work deserves careful consideration for several reasons. First, since it performs additional work with respect to the serial algorithm, a parallel algorithm will generally require more resources such as time and energy. By using more processors, it may be possible to reduce the time penalty, but only by using more hardware resources.
For example, if an algorithm performs $O(\log{n})$-factor more work than the serial algorithm, then, assuming that the constant factor hidden by the asymptotic notation is $1$, when $n = 2^{20}$, it will perform $20$-times more actual work, consuming $20$ times more energy consumption than the serial algorithm. Assuming perfect scaling, we can reduce the time penalty by using more processors. For example, with $40$ processors, the algorithm may require half the time of the serial algorithm.
Sometimes, a parallel algorithm has the same asymptotic complexity of the best serial algorithm for the problem but it has larger constant factors. This is generally true because scheduling friction, especially the cost of creating threads, can be significant. In addition to friction, parallel algorithms can incur more communication overhead than serial algorithms because data and processors may be placed far away in hardware. For example, it is not unusual for a parallel algorithm to incur a $10-100\times$ overhead over a similar serial algorithm because of scheduling friction and communication.
These considerations motivate considering "work efficiency" of parallel algorithm. Work efficiency is a measure of the extra work performed by the parallel algorithm with respect to the serial algorithm. We define two types of work efficiency: asymptotic work efficiency and observed work efficiency. The former relates to the asymptotic performance of a parallel algorithm relative to the fastest sequential algorithm. The latter relates to running time of a parallel algorithm relative to that of the fastest sequential algorithm.
-
A parallel algorithm that comparison-sorts $n$ keys in span $O(\log^3 n)$ and work $O(n \log n)$ is asymptotically work efficient because the work cost is as fast as the best known sequential comparison-based sorting algorithm. However, a parallel sorting algorithm that takes $O(n\log^2{n})$ work is not asymptotically work efficient.
-
The parallel array increment algorithm that we consider in an earlier Chapter is asymptotically work efficient, because it performs linear work, which is optimal (any sequential algorithm must perform at least linear work).
To assess the practical effectiveness of a parallel algorithm, we define observed work efficiency, parameterized a value $r$.
-
A parallel algorithm that runs $10\times$ slower on a single processor that the fastest sequential algorithm has an observed work efficiency factor of $10$. We consider such algorithms unacceptable, as they are too slow and wasteful.
-
A parallel algorithm that runs $1.1\times-1.2\times$. slower on a single processor that the fastest sequential algorithm has observed work efficiency factor of $1.2$. We consider such algorithms to be acceptable.
To obtain this measure, we first run the baseline version of our parallel-increment algorithm.
$ bench.baseline -bench map_incr -n 100000000
exectime 0.884
utilization 1.0000
result 2
We then run the parallel algorithm, which is the same exact code as
map_incr_rec
. We build this code by using the special optfp
"force parallel" file extension. This special file extension forces
parallelism to be exposed all the way down to the base cases. Later,
we will see how to use this special binary mode for other purposes.
$ make bench.optfp
$ bench.optfp -bench map_incr -n 100000000
exectime 45.967
utilization 1.0000
result 2
Our algorithm has an observed work efficiency factor of $60\times$. Such poor observed work efficiency suggests that the parallel algorithm would require more than an order of magnitude more energy and that it would not run faster than the serial algorithm even when using less than $60$ processors.
In practice, observed work efficiency is a major concern. First, the whole effort of parallel computing is wasted if parallel algorithms consistently require more work than the best sequential algorithms. In other words, in parallel computing, both asymptotic complexity and constant factors matter.
Based on these discussions, we define a good parallel algorithm as follows.
7.1. Improving work efficiency with granularity control
It is common for a parallel algorithm to be asymptotically and/or observably work inefficient but it is often possible to improve work efficiency by observing that work efficiency increases with parallelism and can thus be controlled by limiting it.
For example, a parallel algorithm that performs linear work and has logarithmic span leads to average parallelism in the orders of thousands with the small input size of one million. For such a small problem size, we usually would not need to employ thousands of processors. It would be sufficient to limit the parallelism so as to feed tens of processors and as a result reduce impact of excess parallelism on work efficiency.
In many parallel algorithms such as the algorithms based on divide-and-conquer, there is a simple way to achieve this goal: switch from parallel to sequential algorithm when the problem size falls below a certain threshold. This technique is sometimes called coarsening or granularity control.
But which code should we switch to: one idea is to simply switch to the sequential elision, which we always have available in PASL. If, however, the parallel algorithm is asymptotically work inefficient, this would be ineffective. In such cases, we can specify a separate sequential algorithm for small instances.
Optimizing the practical efficiency of a parallel algorithm by controlling its parallelism is sometimes called optimization, sometimes it is called performance engineering, and sometimes performance tuning or simply tuning. In the rest of this document, we use the term "tuning."
We can apply coarsening to map_inc_rec
code by switching to the
sequential algorithm when the input falls below an established
threshold.
long threshold = Some Number; void map_incr_rec(const long* source, long* dest, long lo, long hi) { long n = hi - lo; if (n <= threshold) { for (long i = lo; i < hi; i++) dest[i] = source[i] + 1; } else { long mid = (lo + hi) / 2; fork2([&] { map_incr_rec(source, dest, lo, mid); }, [&] { map_incr_rec(source, dest, mid, hi); }); } }
Note
|
Even in sequential algorithms, it is not uncommon to revert to a different algorithm for small instances of the problem. For example, it is well known that insertion sort is faster than other sorting algorithms for very small inputs containing 30 keys or less. Many optimize sorting algorithm therefore revert to insertion sort when the input size falls within that range. |
As can be seen below, after some tuning, map_incr_rec
program
becomes highly work efficient. In fact, there is barely a difference
between the serial and the parallel runs. The tuning is actually done
automatically here by using an automatic-granularity-control technique
described in the section.
$ bench.baseline -bench map_incr -n 100000000
exectime 0.884
utilization 1.0000
result 2
$ bench.opt -bench map_incr -n 100000000
exectime 0.895
utilization 1.0000
result 2
In this case, we have $r = \frac{T_1}{T_{seq}} = \frac{0.895}{0.884} = 1.012$ observed work efficiency. Our parallel program on a single processor is one percent slower than the sequential baseline. Such work efficiency is excellent.
7.2. Determining the threshold
The basic idea behind coarsening or granularity control is to revert to a fast serial algorithm when the input size falls below a certain threshold. To determine the optimal threshold, we can simply perform a search by running the code with different threshold settings.
While this approach can help find the right threshold on the particular machine that we performed the search, there is no guarantee that the same threshold would work on another machine. In fact, there are examples in the literature that show that such optimizations are not portable, i.e., a piece of code optimized for a particular architecture may behave poorly on another.
In the general case, determining the right threshold is even more
difficult. To see the difficulty consider a generic (polymorphic),
higher-order function such as map
that takes a sequence and a
function and applies the function to the sequence. The problem is
that the threshold depends both on the type of the elements of the
sequence and the function to be mapped over the sequence. For
example, if each element itself is a sequence (the sequence is
nested), the threshold can be relatively small. If, however, the
elements are integers, then the threshold will likely need to be
relatively large. This makes it difficult to determine the threshold
because it depends on arguments that are unknown at compile time.
Essentially the same argument applies to the function being mapped
over the sequence: if the function is expensive, then the threshold
can be relatively small, but otherwise it will need to be relatively
large.
As we describe in this chapter, it is sometimes possible to determine the threshold completely automatically.
8. Chapter: Automatic granularity control
There has been significant research into determining the right threshold for a particular algorithm. This problem, known as the granularity-control problem, turns out to be a rather difficult one, especially if we wish to find a technique that can ensure close-to-optimal performance across different architectures. In this section, we present a technique for automatically controlling granularity by using asymptotic cost functions.
8.1. Complexity functions
Our automatic granularity-control technique requires assistance from the application programmer: for each parallel region of the program, the programmer must annotate the region with a complexity function, which is simply a C++ function that returns the asymptotic work cost associated with running the associated region of code.
map_incr_rec
An application of our map_incr_rec
function on a given range
$[lo, hi)$ of an array has work cost $cn = c
(hi - lo)$ for some constant $c$. As such, the following
lambda expression is one valid complexity function for our parallel
map_incr_rec
program.
auto map_incr_rec_complexity_fct = [&] (long lo, long hi) { return hi - lo; };
In general, the value returned by the complexity function need only be
precise with respect to the asymptotic complexity class of the
associated computation. The following lambda expression is another
valid complexity function for function map_incr_rec
. The complexity
function above is preferable, however, because it is simpler.
const long k = 123; auto map_incr_rec_complexity_fct2 = [&] (long lo, long hi) { return k * (hi - lo); };
More generally, suppose that we know that a given algorithm has work cost of $W = n + \log n$. Although it would be fine to assign to this algorithm exactly $W$, we could just as well assign to the algorithm the cost $n$, because the second term is dominated by the first. In other words, when expressing work costs, we only need to be precise up to the asymptotic complexity class of the work.
8.2. Controlled statements
In PASL, a controlled statement, or cstmt
, is an annotation in
the program text that activates automatic granularity control for a
specified region of code. In particular, a controlled statement
behaves as a C++ statement that has the special ability to choose on
the fly whether or not the computation rooted at the body of the
statement spawns parallel threads. To support such automatic
granularity control PASL uses a prediction algorithm to map the
asymptotic work cost (as returned by the complexity function) to
actual processor cycles. When the predicted processor cycles of a
particular instance of the controlled statement falls below a
threshold (determined automatically for the specific machine), then
that instance is sequentialized, by turning off the ability to spawn
parallel threads for the execution of that instance. If the predicted
processor cycle count is higher than the threshold, then the statement
instance is executed in parallel.
In other words, the reader can think of a controlled statement as a
statement that executes in parallel when the benefits of parallel
execution far outweigh its cost and that executes sequentially in a
way similar to the sequential elision of the body of the controlled
statement would if the cost of parallelism exceeds its benefits. We
note that while the sequential exection is similar to a sequential
elision, it is not exactly the same, because every call to fork2
must check whether it should create parallel threads or run
sequentially. Thus the execution may differ from the sequential
elision in terms of performance but not in terms of behavior or
semantics.
The code below uses a controlled statement to automatically select, at run time, the threshold size for our parallel array-increment function.
controller_type map_incr_rec_contr("map_incr_rec"); void map_incr_rec(const long* source, long* dest, long lo, long hi) { long n = hi - lo; cstmt(map_incr_rec_contr, [&] { return n; }, [&] { if (n == 0) { // do nothing } else if (n == 1) { dest[lo] = source[lo] + 1; } else { long mid = (lo + hi) / 2; fork2([&] { map_incr_rec(source, dest, lo, mid); }, [&] { map_incr_rec(source, dest, mid, hi); }); } }); }
The controlled statement takes three arguments, whose requirements are
specified below, and returns nothing (i.e., void
). The effectiveness
of the granularity controller may be compromised if any of the
requirements are not met.
-
The first argument is a reference to the controller object. The controller object is used by the controlled statement to collect profiling data from the program as the program runs. Every controller object is initialized with a string label (in the code above
"map_incr_rec"
). The label must be unique to the particular controller. Moreover, the controller must be declared as a global variable. -
The second argument is the complexity function. The type of the return value should be
long
. -
The third argument is the body of the controlled statement. The return type of the controlled statement should be
void
.
When the controlled statement chooses sequential evaluation for its
body the effect is similar to the effect where in the code above the
input size falls below the threshold size: the body and the recursion
tree rooted there is sequentialized. When the controlled statement
chooses parallel evaluation, the calls to fork2()
create parallel
threads.
8.2.1. Granularity control with alternative sequential bodies
It is not unusual for a divide-and-conquer algorithm to switch to a different algorithm at the leaves of its recursion tree. For example, sorting algorithms, such as quicksort, may switch to insertion sort at small problem sizes. In the same way, it is not unusual for parallel algorithms to switch to different sequential algorithms for handling small problem sizes. Such switching can be beneficial especially when the parallel algorithm is not asymptotically work efficient.
To provide such algorithmic switching, PASL provides an alternative form of controlled statement that accepts a fourth argument: the alternative sequential body. This alternative form of controlled statement behaves essentially the same way as the original described above, with the exception that when PASL run time decides to sequentialize a particular instance of the controlled statement, it falls through to the provided alternative sequential body instead of the "sequential elision."
controller_type map_incr_rec_contr("map_incr_rec"); void map_incr_rec(const long* source, long* dest, long lo, long hi) { long n = hi - lo; cstmt(map_incr_rec_contr, [&] { return n; }, [&] { if (n == 0) { // do nothing } else if (n == 1) { dest[lo] = source[lo] + 1; } else { long mid = (lo + hi) / 2; fork2([&] { map_incr_rec(source, dest, lo, mid); }, [&] { map_incr_rec(source, dest, mid, hi); }); } }, [&] { for (long i = lo; i < hi; i++) dest[i] = source[i] + 1; }); }
Even though the parallel and sequential array-increment algorithms are
algorithmically identical, except for the calls to fork2()
, there is
still an advantage to using the alternative sequential body: the
sequential code does not pay for the parallelism overheads due to
fork2()
. Even when eliding fork2()
, the run-time-system has to
perform a conditional branch to check whether or not the context of
the fork2()
call is parallel or sequential. Because the cost of
these conditional branches adds up, the version with the sequential
body is going to be more work efficient. Another reason for why a
sequential body may be more efficient is that it can be written
more simply, as for example using a for-loop instead of recursion,
which will be faster in practice.
Note
|
Recommended style for programming with controlled statements
In general, we recommend that the code of the parallel body be written
so as to be completely self contained, at least in the sense that the
parallel body code contains the logic that is necessary to handle
recursion all the way down to the base cases. The code for
We recommend this style because such parallel codes can be debugged, verified, and tuned, in isolation, without relying on alternative sequential codes. |
8.3. Controlled parallel-for loops
Let us add one more component to our granularity-control toolkit: the parallel-for from. By using this loop construct, we can avoid having to explicitly express recursion-trees over and over again. For example, the following function performs the same computation as the example function we defined in the first lecture. Only, this function is much more compact and readable. Moreover, this code takes advantage of our automatic granularity control, also by replacing the parallel-for with a serial-for.
loop_controller_type map_incr_contr("map_incr"); void map_incr(const long* source, long* dest, long n) { parallel_for(map_incr_contr, (long)0, n, [&] (long i) { dest[i] = source[i] + 1; }); }
Underneath, the parallel-for loop uses a divide-and-conquer routine
whose structure is similar to the structure of the divide-and-conquer
routine of our map_incr_rec
. Because the parallel-for loop generates
the log-height recursion tree, the map_incr
routine just above has
the same span as the map_incr
routine that we defined earlier:
$\log n$, where $n$ is the size of the input
array.
Notice that the code above specifies no complexity function. The
reason is that this particular instance of the parallel-for loop
implicitly defines a complexity function. The implicit complexity
function reports a linear-time cost for any given range of the
iteration space of the loop. In other words, the implicit complexity
function assumes that per iteration the body of the loop performs a
constant amount of work. Of course, this assumption does not hold in
general. If we want to specify explicitly the complexity function, we
can use the form shown in the example below. The complexity function
is passed to the parallel-for loop as the fourth argument. The
complexity function takes as argument the range [lo, hi)
. In this
case, the complexity is linear in the number of iterations. The
function simply returns the number of iterations as the complexity.
loop_controller_type map_incr_contr("map_incr"); void map_incr(const long* source, long* dest, long n) { auto linear_complexity_fct = [] (long lo, long hi) { return hi-lo; }; parallel_for(map_incr_contr, linear_complexity_fct, (long)0, n, [&] (long i) { dest[i] = source[i] + 1; }); }
The following code snippet shows a more interesting case for the complexity function. In this case, we are performing a multiplication of a dense matrix by a dense vector. The outer loop iterates over the rows of the matrix. The complexity function in this case gives to each of these row-wise iterations a cost in proportion to the number of scalars in each column.
loop_controller_type dmdvmult_contr("dmdvmult"); // mtx: nxn dense matrix, vec: length n dense vector // dest: length n dense vector void dmdvmult(double* mtx, double* vec, double* dest, long n) { auto compl_fct = [&] (long lo, long hi) { return (hi-lo)*n; }; parallel_for(dmdvmult_contr, compl_fct, (long)0, n, [&] (long i) { ddotprod(mtx, v, dest, i); }); return dest; }
Matrix multiplication has been widely used as an example for parallel computing since the early days of the field. There are good reasons for this. First, matrix multiplication is a key operation that can be used to solve many interesting problems. Second, it is an expansive computation that is nearly cubic in the size of the input---it can thus can become very expensive even with modest inputs.
Fortunately, matrix multiplication can be parallelized relatively easily as shown above. The figure below shows the speedup for a sample run of this code. Observe that the speedup is rather good, achieving nearly excellent utilization.
While parallel matrix multiplication delivers excellent speedups, this is not common for many other algorithms on modern multicore machines where many computations can quickly become limited by the availability of bandwidth. Matrix multiplication does not suffer as much from the memory-bandwidth limitations because it performs significant work per memory operation: it touches $\Theta(n^2)$ memory cells, while performing nearly $\Theta(n^3)$ work.
9. Simple Parallel Arrays
Arrays are a fundamental data structure in sequential and parallel computing. When computing sequentially, arrays can sometimes be replaced by linked lists, especially because linked lists are more flexible. Unfortunately, linked lists are deadly for parallelism, because they require serial traversals to find elements; this makes arrays all the more important in parallel computing.
Unfortunately, it is difficult to find a good treatment of parallel arrays in C++: the various array implementations provided by C++ have been designed primarily for sequential computing. Each one has various pitfalls for parallel use.
By default, C++ arrays that are created by the new[]
operator are initialized sequentially. Therefore, the work and span
cost of the call new[n]
is n
. But we can initialize an array in
logarithmic span in the number of items.
The "vector" data structure that is provided by the Standard Template
Library (STL) has similar issues. The STL vector implements a
dynamically resizable array that provides push, pop, and indexing
operations. The push and pop operations take amortized constant time
and the indexing operation constant time. As with C++
arrays, initialization of vectors can require linear work and span.
The STL vector also provides the method resize(n)
which changes the
size of the array to be n
. The resize operation takes, in the worst
case, linear work and span in proportion to the new size, n
. In
other words, the resize function uses a sequential algorithm to fill
the cells in the vector. The resize
operation is therefore not
arrays.
Such sequential computations that exist behind the wall of abstraction of a language or library can harm parallelism by introducing implicit sequential dependencies. Finding the source of such sequential bottlenecks can be time consuming, because they are hidden behind the abstraction boundary of the native array abstraction that is provided by the programming language.
We can avoid such pitfalls by carefully designing our own array data structure. Because array implementations are quite subtle, we consider our own implementation of parallel arrays, which makes explicit the cost of array operation, allowing us to control them quite carefully. Specifically, we carefully control initialization and disallow implicit copy operations on arrays, because copy operations can harm observable work efficiency (their asymptotic work cost is linear).
9.1. Interface and cost model
The key components of our array data structure, sparray
, are shown
by the code snippet below. An sparray
can store 64-bit words only;
in particular, they are monomorphic and fixed to values of type
long
. Generalizing sparray
to store values of arbitrary types can
be achieved by use of C++ templates. We stick to
monomorphic arrays here to simplify the presentation.
using value_type = long; class sparray { public: // n: size to give to the array; by default 0 sparray(unsigned long n = 0); // constructor from list sparray(std::initializer_list<value_type> xs); // indexing operator value_type& operator[](unsigned long i); // size of the array unsigned long size() const; };
The class sparray
provides two constructors. The first one takes in
the size of the array (set to 0 by default) and allocates an
unitialized array of the specified size (nullptr
if size is 0). The
second constructor takes in a list specified by curly braces and
allocates an array with the same size. Since the argument to this
constructor must be specified explicitly in the program, its size is
constant by definition.
The cost model guaranteed by our implementation of parallel array is as follows:
-
Constructors/Allocation: The work and span of simply allocating an array on the heap, without initialization, is constant. The second constructor performs initialization, based on constant-size lists, and thus also has constant work and span.
-
Array indexing: Each array-indexing operation, that is the operation which accesses an individual cell, requires constant work and constant span.
-
Size operation: The work and the span of accessing the size of the array is constant.
-
Destructors/Deallocation: Not shown, the class includes a destructor that frees the array. Combined with the "move assignment operator" that C++ allows us to define, destructors can be used to deallocate arrays when they are out of scope. The destructor takes constant time because the contents of the array are just bits that do not need to be destructed individually.
-
Move assignment operator: Not shown, the class includes a move-assignment operator that gets fired when an array is assigned to a variable. This operator moves the contents of the right-hand side of the assigned array into that of the left-hand side. This operation takes constant time.
-
Copy constructor: The copy constructor of
sparray
is disabled. This prohibits copying an array unintentionally, for example, by passing the array by value to a function.
Note
|
The constructors of our array class do not perform initializations that involve non-constant work. If desired, the programmer can write an initializer that performs linear work and logarithmic span (if the values used for initialization have non-constant time cost, these bounds may need to be scaled accordingly). |
This program below shows a basic use sparray
's. The first line
allocates and initializes the contents of the array to be three
numbers. The second uses the familiar indexing operator to access the
item at the second position in the array. The third line extracts the
size of the array. The fourth line assigns to the second cell the
value 5. The fifth prints the contents of the cell.
sparray xs = { 1, 2, 3 }; std::cout << "xs[1] = " << xs[1] << std::endl; std::cout << "xs.size() = " << xs.size() << std::endl; xs[2] = 5; std::cout << "xs[2] = " << xs[2] << std::endl;
Output:
xs[1] = 2
xs.size() = 3
xs[2] = 5
9.2. Allocation and deallocation
Arrays can be allocated by specifying the size of the array.
sparray zero_length = sparray(); sparray another_zero_length = sparray(0); sparray yet_another_zero_length; sparray length_five = sparray(5); // contents uninitialized std::cout << "|zero_length| = " << zero_length.size() << std::endl; std::cout << "|another_zero_length| = " << another_zero_length.size() << std::endl; std::cout << "|yet_another_zero_length| = " << yet_another_zero_length.size() << std::endl; std::cout << "|length_five| = " << length_five.size() << std::endl;
Output:
|zero_length| = 0
|another_zero_length| = 0
|yet_another_zero_length| = 0
|length_five| = 5
Just after creation, the array contents consist of uninitialized
bits. We use this convention because the programmer needs flexibility
to decide the parallelization strategy to initialize the contents.
Internally, the sparray
class consists of a size field and a pointer
to the first item in the array. The contents of the array are heap
allocated (automatically) by constructor of the sparray
class.
Deallocation occurs when the array’s destructor is called. The
destructor can be called by the programmer or by run-time system (of
C++) if an object storing the array is destructed. Since
C++ destructs (stack allocated) variables that go out of
scope when a function returns, we can combine the stack discipline
with heap-allocated arrays to manage the deallocation of arrays mostly
automatically. We give several examples of this automatic
deallocation scheme below.
In the function below, the sparray
object that is allocated on the
frame of foo
is deallocated just before foo
returns, because the
variable xs
containing it goes out of scope.
void foo() { sparray xs = sparray(10); // array deallocated just before foo() returns }
Care must be taken when managing arrays, because nothing prevents the programmer from returning a dangling pointer.
value_type* foo() { sparray xs = sparray(10); ... // array deallocated just before foo() returns return &xs[0] } ... std::cout << "contents of deallocated memory: " << *foo() << std::endl;
Output:
contents of deallocated memory: .... (undefined)
It is safe to take a pointer to a cell in the array, when the array itself is still in scope. For example, in the code below, the contents of the array are used strictly when the array is in scope.
void foo() { sparray xs = sparray(10); xs[0] = 34; bar(&xs[0]); ... // array deallocated just before foo() returns } void bar(value_type* p) { std::cout << "xs[0] = " << *p << std::endl; }
Output:
xs[0] = 34
We are going to see that we can rely on cleaner conventions for passing to functions references on arrays.
9.3. Passing to and returning from functions
If you are familiar with C++ container libraries, such as STL, this aspect of our array implementation, namely the calling conventions, may be unfamiliar: our arrays cannot be passed by value. We forbid passing by value because passing by value implies creating a fresh copy for each array being passed to or returned by a function. Of course, sometimes we really need to copy an array. In this case, we choose to copy the array explicitly, so that it is obvious where in our code we are paying a linear-time cost for copying out the contents. We will return to the issue of copying later.
What then happens if the program tries to pass an array to a function?
The program will be rejected by the compiler. The code below does
not compile, because we have disabled the copy constructor of our
sparray
class.
value_type foo(sparray xs) { return xs[0]; } void bar() { sparray xs = { 1, 2 }; foo(xs); }
The following code does compile, because in this case we pass
the array xs
to foo
by reference.
value_type foo(const sparray& xs) { return xs[0]; } void bar() { sparray xs = { 1, 2 }; foo(xs); }
Returning an array is straightforward: we take advantage of a feature
of modern C++11 which automatically detects when it is
safe to move a structure by a constant-time pointer swap. Code of the
following form is perfectly legal, even though we disabled the copy
constructor of sparray
, because the compiler is able to transfer
ownership of the array to the caller of the function. Moreover, the
transfer is guaranteed to be constant work — not linear like a copy
would take. The return is fast, because internally all that happens is
that a couple words are being exchanged. Such "move on return" is
achieved by the "move-assignment operator" of sparray
class.
sparray fill_seq(long n, value_type x) { sparray tmp = sparray(n); for (long i = 0; i < n; i++) tmp[i] = x; return tmp; } void bar() { sparray xs = fill_seq(4, 1234); std::cout << "xs = " << xs << std::endl; }
Output after calling bar()
:
xs = { 1234, 1234, 1234, 1234 }
Although it is perfectly fine to assign to an array variable the contents of a given array, what happens may be surprising to those who know the usual conventions of C++11 container libraries. Consider the following program.
sparray xs = fill_seq(4, 1234); sparray ys = fill_seq(3, 333); ys = std::move(xs); std::cout << "xs = " << xs << std::endl; std::cout << "ys = " << ys << std::endl;
The assignment from xs
to ys
simultaneously destroys the contents
of ys
(by calling its destructor, which nulls it out), namely the
array { 333, 333, 333 }
, moves the contents of xs
to ys
, and
empties out the contents of xs
. This behavior is defined as part of
the move operator of sparray
. The result is the following.
xs = { }
ys = { 1234, 1234, 1234, 1234 }
The reason we use this semantics for assignment is that the assignment takes constant time. Later, we are going to see that we can efficiently copy items out of an array. But for reasons we already discussed, the copy operation is going to be explicit.
Exercise: duplicating items in parallel
9.4. Tabulation
A tabulation is a parallel operation which creates a new array of a
given size and initializes the contents according to a given
"generator function". The call tabulate(g, n)
allocates an array of
length n
and assigns to each valid index in the array i
the value
returned by g(i)
.
template <class Generator> sparray tabulate(Generator g, long n);
Tabulations can be used to generate sequences according to a specified formula.
sparray evens = tabulate([&] (long i) { return 2*i; }, 5); std::cout << "evens = " << evens << std::endl;
Output:
evens = { 0, 2, 4, 6, 8 }
Copying an array can be expressed as a tabulation.
sparray mycopy(const sparray& xs) { return tabulate([&] (long i) { return xs[i]; }, xs.size()); }
duplicate
and ktimes
exercisessparray ktimes(const sparray& xs, long k) { long m = xs.size() * k; return tabulate([&] (long i) { return xs[i/k]; }, m); } sparray duplicate(const sparray& xs) { return ktimes(xs, 2); }
The implementation of tabulate is a straightforward application of the parallel-for loop.
loop_controller_type tabulate_contr("tabulate"); template <class Generator> sparray tabulate(Generator g, long n) { sparray tmp = sparray(n); parallel_for(tabulate_contr, (long)0, n, [&] (long i) { tmp[i] = g(i); }); return tmp; }
Note that the work and span of the generator function depends on the generator function passed as an argument to the tabulation. Let us first analyze for the simple case, where the generator function takes constant work (and hence, constant span). In this case, it should be clear that a tabulation should take work linear in the size of the array. The reason is that the only work performed by the body of the loop is performed by the constant-time generator function. Since the loop itself performs as many iterations as positions in the array, the work cost is indeed linear in the size of the array. The span cost of the tabulation is the sum of two quantitles: the span taken by the loop and the maximum value of the spans taken by the applications of the generator function. Recall that we saw before that the span cost of a parallel-for loop with $n$ iterations is $\log n$. The maximum of the spans of the generator applications is a constant. Therefore, we can conclude that, in this case, the span cost is logarithmic in the size of the array.
The story is only a little more complicated when we generalize to consider non-constant time generator functions. Let $W(\mathtt{g}(i))$ denote the work performed by an application of the generator function to a specified value $i$. Similarly, let $S(\mathtt{g}(i))$ denote the span. Then the tabulation takes work
and span
9.5. Higher-order granularity controllers
We just observed that each application of our tabulate operation can
have different work and span cost depending on the selection of the
generator function. Pause for a moment and consider how this
observation could impact our granularity-control scheme. Consider, in
particular, the way that the tabulate function uses its
granularity-controller object, tabulate_contr
. This one controller
object is shared by every call site of tabulate()
.
The problem is that all of the profiling data that the granularity
controller collects at run time is lumped together, even though each
generator function that is passed to the tabulate function can have
completely different performance characteristics. The threshold that
is best for one generator function is not necessarily good for another
generator function. For this reason, there must be one distinct
granularity-control object for each generator function that is passed
to tabulate
. For this reason, we refine our solution from the one
above to the one below, which relies on C++ template programming to
effectively key accesses to the granularity controller object by the
type of the generator function.
template <class Generator> class tabulate_controller { public: static loop_controller_type contr; }; template <class Generator> loop_controller_type tabulate_controller<Generator>::contr("tabulate"+std::string(typeid(Generator).name())); template <class Generator> sparray tabulate(Generator g, long n) { sparray tmp = sparray(n); parallel_for(tabulate_controller<Generator>::contr, (long)0, n, [&] (long i) { tmp[i] = g(i); }); return tmp; }
To be more precise, the use of the template parameter in the class
tabulate_controller
ensures that each generator function in the
program that is passed to tabulate()
gets its own unique instance of
the controller object. The rules of the template system regarding
static class members that appear in templated classes ensure this
behavior. Although it is not essential for our purposes to have a
precise understanding of the template system, it is useful to know
that the template system provides us with the exact mechanism that we
need to properly separate granularity controllers of distinct
instances of higher-order functions, such as tabulation.
Note
|
Above, we still assume constant-work generator functions. |
9.6. Reduction
A reduction is an operation which combines a given set of values according to a specified identity element and a specified associative combining operator. Let $S$ denote a set. Recall from algebra that an associative combining operator is any binary operator $\oplus$ such that, for any three items $x,y,z \in S$, the following holds.
An element $\mathbf{I} \in S$ is an identity element if for any $x \in S$ the following holds.
This algebraic structure consisting of $(S, \oplus, \mathbf{I})$ is called a monoid and is particularly worth knowing because this structure is a common pattern in parallel computing.
-
$S$ = the set of all 64-bit unsigned integers; $\oplus$ = addition modulo $2^{64}$; $\mathbf{I}$ = 0
-
$S$ = the set of all 64-bit unsigned integers; $\oplus$ = multiplication modulo $2^{64}$; $\mathbf{I}$ = 1
-
$S$ = the set of all 64-bit unsigned integers; $\oplus$ = max function; $\mathbf{I}$ = 0
The identity element is important because we are working with sequences: having a base element is essential for dealing with empty sequences. For example, what should the sum of the empty sequence? More interestingly, what should be the maximum (or minimum) element of an empty sequence? The identity element specifies this behavior.
What about the associativity of $\oplus$? Why does associativity matter? Suppose we are given the sequence $ [ a_0, a_1, \ldots, a_n ] $. The serial reduction of this sequence always computes the expression $(a_0 \oplus a_1 \oplus \ldots \oplus a_n)$. However, when the reduction is performed in parallel, the expression computed by the reduction could be $( (a_0 \oplus a_1 \oplus a_2 \oplus a_3) \oplus (a_4 \oplus a_5) \oplus \ldots \oplus (a_{n-1} \oplus a_n) )$ or $( (a_0 \oplus a_1 \oplus a_2) \oplus (a_3 \oplus a_4 \oplus a_5) \oplus \ldots \oplus (a_{n-1} \oplus a_n) )$. In general, the exact placement of the parentheses in the parallel computation depends on the way that the parallel algorithm decomposes the problem. Associativity gives the parallel algorithm the flexibility to choose an efficient order of evaluation and still get the same result in the end. The flexibility to choose the decomposition of the problem is exploited by efficient parallel algorithms, for reasons that should be clear by now. In summary, associativity is a key building block to the solution of many problems in parallel algorithms.
Now that we have monoids for describing a generic method for combining
two items, we can consider a generic method for combining many items
in parallel. Once we have this ability, we will see that we can solve
the remaining problems from last homework by simply plugging the
appropriate monoids into our generic operator, reduce
. The interface
of this operator in our framework is specified below. The first
parameter corresponds to $\oplus$, the second to the
identity element, and the third to the sequence to be processed.
template <class Assoc_binop> value_type reduce(Assoc_binop b, value_type id, const sparray& xs);
We can solve our first problem by plugging integer plus as $\oplus$ and 0 as $\mathbf{I}$.
auto plus_fct = [&] (value_type x, value_type y) { return x+y; }; sparray xs = { 1, 2, 3 }; std::cout << "sum_xs = " << reduce(plus_fct, 0, xs) << std::endl;
Output:
reduce(plus_fct, 0, xs) = 6
We can solve our second problem in a similar fashion. Note that in this case, since we know that the input sequence is nonempty, we can pass the first item of the sequence as the identity element. What could we do if we instead wanted a solution that can deal with zero-length sequences? What identity element might make sense in that case? Why?
Let us start by solving a special case: the one where the input sequence is nonempty.
auto max_fct = [&] (value_type x, value_type y) { return std::max(x, y); }; sparray xs = { -3, 1, 634, 2, 3 }; std::cout << "reduce(max_fct, xs[0], xs) = " << reduce(max_fct, xs[0], xs) << std::endl;
Output:
reduce(max_fct, xs[0], xs) = 634
Observe that in order to seed the reduction we selected the
provisional maximum value to be the item at the first position of the
input sequence. Now let us handle the general case by seeding with the
smallest possible value of type long
.
long max(const sparray& xs) { return reduce(max_fct, LONG_MIN, xs); }
The value of LONG_MIN
is defined by <limits.h>
.
Like the tabulate function, reduce is a higher-order function. Just like any other higher-order function, the work and span costs have to account for the cost of the client-supplied function, which is in this case, the associative combining operator.
9.7. Scan
A scan is an iterated reduction that is typically expressed in one of two forms: inclusive and exclusive. The inclusive form maps a given sequence $ [ x_0, x_1, x_2, \ldots, x_{n-1} ] $ to $ [ x_0, x_0 \oplus x_1, x_0 \oplus x_1 \oplus x_2, \ldots, x_0 \oplus x_1 \oplus \ldots \oplus x_{n-1} ] $.
template <class Assoc_binop> sparray scan_incl(Assoc_binop b, value_type id, const sparray& xs);
scan_incl(b, 0, sparray({ 2, 1, 8, 3 })) = { reduce(b, id, { 2 }), reduce(b, id, { 2, 1 }), reduce(b, id, { 2, 1, 8 }), reduce(b, id, { 2, 1, 8, 3}) } = { 0+2, 0+2+1, 0+2+1+8, 0+2+1+8+3 } = { 2, 3, 11, 14 }
The exclusive form maps a given sequence $ [ x_0, x_1, x_2, \ldots, x_{n-1} ] $ to $ [ \mathbf{I}, x_0, x_0 \oplus x_1, x_0 \oplus x_1 \oplus x_2, \ldots, x_0 \oplus x_1 \oplus \ldots \oplus x_{n-2} ] $. For convenience, we extend the result of the exclusive form with the total $ x_0 \oplus \ldots \oplus x_{n-1} $.
class scan_excl_result { public: sparray partials; value_type total; }; template <class Assoc_binop> scan_excl_result scan_excl(Assoc_binop b, value_type id, const sparray& xs);
The example below represents the logical behavior of scan
, but
actually says nothing about the way scan is implemented.
scan_excl(b, 0, { 2, 1, 8, 3 }).partials = { reduce(b, 0, { }), reduce(b, 0, { 2 }), reduce(b, 0, { 2, 1 }), reduce(b, 0, { 2, 1, 8}) } = { 0, 0+2, 0+2+1, 0+2+1+8 } = { 0, 2, 3, 11 } scan_excl(b, 0, { 2, 1, 8, 3 }).total = reduce(b, 0, { 2, 1, 8, 3}) = { 0+2+1+8+3 } = 14
Scan has applications in many parallel algorithms. To name just a few, scan has been used to implement radix sort, search for regular expressions, dynamically allocate processors, evaluate polynomials, etc. Suffice to say, scan is important and worth knowing about because scan is a key component of so many efficient parallel algorithms. In this course, we are going to study a few more applications not in this list.
The expansions shown above suggest the following sequential algorithm.
template <class Assoc_binop> scan_excl_result scan_excl_seq(Assoc_binop b, value_type id, const sparray& xs) { long n = xs.size(); sparray r = array(n); value_type x = id; for (long i = 0; i < n; i++) { r[i] = x; x = b(x, xs[i]); } return make_scan_result(r, x); }
If we just blindly follow the specification above, we might be tempted to try the solution below.
loop_controller_type scan_contr("scan"); template <class Assoc_binop> sparray scan_excl(Assoc_binop b, value_type id, const sparray& xs) { long n = xs.size(); sparray result = array(n); result[0] = id; parallel_for(scan_contr, 1l, n, [&] (long i) { result[i] = reduce(b, id, slice(xs, 0, i-1)); }); return result; }
Consider that our sequential algorithm takes linear time in the size of the input array. As such, finding a work-efficient parallel solution means finding a solution that also takes linear work in the size of the input array. The problem is that our parallel algorithm takes quadratic work: it is not even asymptotically work efficient! Even worse, the algorithm performs a lot of redundant work.
Can we do better? Yes, in fact, there exist solutions that take, in the size of the input, both linear time and logarithmic span, assuming that the given associative operator takes constant time. It might be worth pausing for a moment to consider this fact, because the specification of scan may at first look like it would resist a solution that is both highly parallel and work efficient.
9.8. Derived operations
The remaining operations that we are going to consider are useful for writing more succinct code and for expressing special cases where certain optimizations are possible. All of the the operations that are presented in this section are derived forms of tabulate, reduce, and scan.
9.8.1. Map
The map(f, xs)
operation applies f
to each item in xs
returning
the array of results. It is straightforward to implement as a kind of
tabulation, as we have at our disposal efficient indexing.
template <class Func> sparray map(Func f, sparray xs) { return tabulate([&] (long i) { return f(xs[i]); }, xs.size()); }
The array-increment operation that we defined on the first day of
lecture is simple to express via map
.
map
sparray map_incr(sparray xs) { return map([&] (value_type x) { return x+1; }, xs); }
The work and span costs of map
are similar to those of
tabulate. Granularity control is handled similarly as well. However,
that the granularity controller object corresponding to map
is
instantiated properly is not obvious. It turns out that, for no extra
effort, the behavior that we want is indeed preserved: each distinct
function that is passed to map
is assigned a distinct granularity
controller. Although it is outside the scope of this course, the
reason that this scheme works in our current design owes to specifics
of the C++ template system.
9.8.2. Fill
The call fill(v, n)
creates an array that is initialized with a
specified number of items of the same value. Although just another
special case for tabulation, this function is worth having around
because internally the fill
operation can take advantage of special
hardware optimizations, such as SIMD instructions, that increase
parallelism.
sparray fill(value_type v, long n);
sparray threes = fill(3, 5); std::cout << "threes = " << threes << std::endl;
Output:
threes = { 3, 3, 3, 3, 3 }
9.8.3. Copy
Just like fill
, the copy
operation can take advantage of special
hardware optimizations that accellerate memory traffic. For the same
reason, the copy
operation is a good choice when a full copy is
needed.
sparray copy(const sparray& xs);
sparray xs = { 3, 2, 1 }; sparray ys = copy(xs); std::cout << "xs = " << xs << std::endl; std::cout << "ys = " << ys << std::endl;
Output:
xs = { 3, 2, 1 }
ys = { 3, 2, 1 }
9.8.4. Slice
We now consider a slight generalization on the copy operator: with the
slice
operation we can copy out a range of positions from a given
array rather than the entire array.
sparray slice(const sparray& xs, long lo, long hi);
The slice
operation takes a source array and a range to copy out and
returns a fresh array that contains copies of the items in the given
range.
sparray xs = { 1, 2, 3, 4, 5 }; std::cout << "slice(xs, 1, 3) = " << slice(xs, 1, 3) << std::endl; std::cout << "slice(xs, 0, 4) = " << slice(xs, 0, 4) << std::endl;
Output:
{ 2, 3 }
{ 1, 2, 3, 4 }
9.8.5. Concat
In contrast to slice
, the concat
operation lets us "copy in" to a
fresh array.
sparray concat(const sparray& xs, const sparray& ys);
sparray xs = { 1, 2, 3 }; sparray ys = { 4, 5 }; std::cout << "concat(xs, ys) = " << concat(xs, ys) << std::endl;
Output:
{ 1, 2, 3, 4, 5 }
9.8.6. Prefix sums
The prefix sums problem is a special case of the scan problem. We have defined two solutions for two variants of the problem: one for the exclusive prefix sums and one for the inclusive case.
sparray prefix_sums_incl(const sparray& xs); scan_excl_result prefix_sums_excl(const sparray& xs);
sparray xs = { 2, 1, 8, 3 }; sparray incl = prefix_sums_incl(xs); scan_excl_result excl = prefix_sums_excl(xs); std::cout << "incl = " << incl << std::endl; std::cout << "excl.partials = " << excl.partials << "; excl.total = " << excl.total << std::endl;
Output:
incl = { 2, 3, 11, 14 }
excl.partials = { 0, 2, 3, 11 }; excl.total = 147
9.8.7. Filter
The last data-parallel operation that we are going to consider is the operation that copies out items from a given array based on a given predicate function.
template <class Predicate> sparray filter(Predicate pred, const sparray& xs);
For our purposes, a predicate function is any function that takes a
value of type long
(i.e., value_type
) and returns a value of type
bool
.
The following function copies out the even numbers it receives in the array of its argument.
bool is_even(value_type x) { return (x%2) == 0; } sparray extract_evens(const sparray& xs) { return filter([&] (value_type x) { return is_even(x); }, xs); } sparray xs = { 3, 5, 8, 12, 2, 13, 0 }; std::cout << "extract_evens(xs) = " << extract_evens(xs) << std::endl;
Output:
extract_evens(xs) = { 8, 12, 2, 0 }
The particular instance of the filter problem that we are considering is a little tricky because we are working with fixed-size arrays. In particular, what requires care is the method that we use to copy the selected items out of the input array to the output array. We need to first run a pass over the input array, applying the predicate function to the items, to determine which items are to be written to the output array. Furthermore, we need to track how many items are to be written so that we know how much space to allocate for the output array.
template <class Predicate> sparray filter(Predicate pred, const sparray& xs) { long n = xs.size(); long m = 0; sparray flags = array(n); for (long i = 0; i < n; i++) if (pred(xs[i])) { flags[i] = true; m++; } sparray result = array(m); long k = 0; for (long i = 0; i < n; i++) if (flags[i]) result[k++] = xs[i]; return result; }
9.8.8. Parallel-filter problem
The starting point for our solution is the following code.
template <class Predicate> sparray filter(Predicate p, const sparray& xs) { sparray flags = map(p, xs); return pack(flags, xs); }
The challenge of this exercise is to solve the following problem: given two arrays of the same size, the first consisting of boolean valued fields and the second containing the values, return the array that contains (in the same relative order as the items from the input) the values selected by the flags. Your solution should take linear work and logarithmic span in the size of the input.
sparray pack(const sparray& flags, const sparray& xs);
sparray flags = { true, false, false, true, false, true, true }; sparray xs = { 34, 13, 5, 1, 41, 11, 10 }; std::cout << "pack(flags, xs) = " << pack(flags, xs) << std::endl;
Output:
pack(flags, xs) = { 34, 1, 11, 10 }
Tip
|
You can use scans to implement pack . |
Note
|
Even though our arrays can store only 64-bit values of type
long , we can nevertheless store values of type bool , as we have
done just above with the flags array. The compiler automatically
promotes boolean values to long values without causing us any
problems, at least with respect to the correctness of our
solutions. However, if we want to be more space efficient, we need to
use arrays that are capable of packing values of type bool more
efficiently, e.g., into single- or eight-bit fields. It should easy to
convince yourself that achieving such specialized arrays is not
difficult, especially given that the template system makes it easy to
write polymorphic containers. |
9.9. Summary of operations
9.9.1. Tabulate
template <class Generator> sparray tabulate(Generator g, long n);
The call tabulate(g, n)
returns the length-n
array where the i
th element is given by g(i)
.
Let $W(\mathtt{g}(i))$ denote the work performed by an application of the generator function to a specified value $i$. Similarly, let $S(\mathtt{g}(i))$ denote the span. Then the tabulation takes work
and span
9.9.2. Reduce
template <class Assoc_binop> value_type reduce(Assoc_binop b, value_type id, const sparray& xs);
The call reduce(b, id, xs)
is logically equal to id
if xs.size()
== 0
, xs[0]
if xs.size() == 1
, and
b(reduce(b, id, slice(xs, 0, n/2)), reduce(b, id, slice(xs, n/2, n)))
otherwise where n == xs.size()
.
The work and span cost are $O(n)$ and $O(\log
n)$ respectively, where $n$ denotes the size of the input
sequence xs
. This cost assumes that the work and span of b
are
constant. If it’s not the case, then refer directly to the
implementation of reduce
.
9.9.3. Scan
template <class Assoc_binop> sparray scan_incl(Assoc_binop b, value_type id, const sparray& xs);
For an associative function b
and corresponding identity id
, the
return result of the call scan_incl(b, id, xs)
is logically
equivalent to
tabulate([&] (long i) { return reduce(b, id, slice(xs, 0, i+1)); }, xs.size())
class scan_excl_result { public: sparray partials; value_type total; }; template <class Assoc_binop> scan_excl_result scan_excl(Assoc_binop b, value_type id, const sparray& xs);
For an associative function b
and corresponding identity id
, the
call scan_excl(b, id, xs)
returns the object res
, such that
res.partials
is logically equivalent to
tabulate([&] (long i) { return reduce(b, id, slice(xs, 0, i)); }, xs.size())
and res.total
is logically equivalent to
reduce(b, id, xs)
The work and span cost are $O(n)$ and $O(\log
n)$ respectively, where $n$ denotes the size of the input
sequence xs
. This cost assumes that the work and span of b
are
constant. If it’s not the case, then refer directly to the
implementation of scan_incl
and scan_excl
.
9.9.4. Map
template <class Func> sparray map(Func f, sparray xs) { return tabulate([&] (long i) { return f(xs[i]); }, xs.size()); }
Let $W(\mathtt{f}(x))$ denote the work performed by an
application of the function f
to a specified value
$x$. Similarly, let $S(\mathtt{f}(x))$ denote
the span. Then the map takes work
and span
9.9.5. Fill
sparray fill(value_type v, long n);
Returns a length-n
array with all cells initialized to v
.
Work and span are linear and logarithmic, respectively.
9.9.6. Copy
sparray copy(const sparray& xs);
Returns a fresh copy of xs
.
Work and span are linear and logarithmic, respectively.
9.9.7. Slice
sparray slice(const sparray& xs, long lo, long hi);
The call slice(xs, lo, hi)
returns the array {xs[lo], xs[lo+1],
..., xs[hi-1]}
.
Work and span are linear and logarithmic, respectively.
9.9.8. Concat
sparray concat(const sparray& xs, const sparray& ys);
Concatenate the two sequences.
Work and span are linear and logarithmic, respectively.
9.9.9. Prefix sums
sparray prefix_sums_incl(const sparray& xs);
The result of the call prefix_sums_incl(xs)
is logically equivalent
to the result of the call
scan_incl([&] (value_type x, value_type y) { return x+y; }, 0, xs)
scan_excl_result prefix_sums_excl(const sparray& xs);
The result of the call prefix_sums_excl(xs)
is logically equivalent
to the result of the call
scan_excl([&] (value_type x, value_type y) { return x+y; }, 0, xs)
Work and span are linear and logarithmic, respectively.
9.9.10. Filter
template <class Predicate> sparray filter(Predicate p, const sparray& xs);
The call filter(p, xs)
returns the subsequence of xs
which
contains each xs[i]
for which p(xs[i])
returns true
.
Warning
|
Depending on the implementation, the predicate function p
may be called multiple times by the filter function. |
Work and span are linear and logarithmic, respectively.
10. Chapter: Parallel Sorting
In this chapter, we are going to study parallel implementations of quicksort and mergesort.
10.1. Quicksort
The quicksort algorithm for sorting an array (sequence) of elements is known to be a very efficient sequential sorting algorithm. A natural question thus is whether quicksort is similarly effective as a parallel algorithm?
Let us first convince ourselves, at least informally, that quicksort is actually a good parallel algorithm. But first, what do we mean by "parallel quicksort." Chances are that you think of quicksort as an algorithm that, given an array, starts by reorganizing the elements in the array around a randomly selected pivot by using an in-place partitioning algorithm, and then sorts the two parts of the array to the left and the right of the array recursively.
While this implementation of the quicksort algorithm is not
immediately parallel, it can be parallelized. Note that the recursive
calls are naturally independent. So we really ought to focus on the
partitioning algorithm. There is a rather simple way to do such a
partition in parallel by performing three filter
calls on the input
array, one for picking the elements less that the pivot, one for
picking the elements equal to the pivot, and another one for picking
the elements greater than the pivot. This algorithm can be described
as follows.
Now that we have a parallel algorithm, we can check whether it is a good algorithm or not. Recall that a good parallel algorithm is one that has the following three characteristics
-
It is asymptotically work efficient
-
It is observably work efficient
-
It is highly parallel, i.e., has low span.
Let us first convince ourselves that Quicksort is a highly parallel algorithm. Observe that
-
the dividing process is highly parallel because no dependencies exist among the intermediate steps involved in creating the three-way partition,
-
two recursive calls are parallel, and
-
concatenations are themselves highly parallel.
10.1.1. Asymptotic Work Efficiency and Parallelism
Let us now turn our attention to asymptotic and observed work efficiency. Recall first that quicksort can exhibit a quadratic-work worst-case behavior on certain inputs if we select the pivot deterministically. To avoid this, we can pick a random element as a pivot by using a random-number generator, but then we need a parallel random number generator. Here, we are going to side-step this issue by assuming that the input is randomly permuted in advance. Under this assumption, we can simply pick the pivot to be the first item of the sequence. With this assumption, our algorithm performs asymptotically the same work as sequential quicksort implementations that perform $\Theta(n\log{n})$ in expectation.
For the analysis, let’s assume a version of quicksort that compares the pivot $p$ to each key in the input once (instead of 3 times. The figure below illustrates the structure of an execution of quicksort by using a tree. Each node corresponds to a call to the quicksort function and is labeled with the key at that call. Note that the tree is a binary search tree.
Let’s observe some properties of quicksort.
-
In quicksort, a comparison always involves a pivot and another key. Since, the pivot is never sent to a recursive call, a key is selected as a pivot exactly once, and is not involved in further comparisons (after is becomes a pivot). Before a key is selected as a pivot, it may be compared to other pivots, once per pivot, and thus two keys are never compared more than once.
-
When the algorithm selects a key $y$ as a pivot and if $y$ is between two other keys $x, z$ such that $x < y < z$, it sends the two keys latexmath:[$y, z$ to two separate subtrees. The two keys $x$ and $z$ separated in this way are never compared again.
-
We can sum up the two observations: a key is compared with all its ancestors in the call tree and all its descendants in the call tree, and with no other keys.
Since a pair of key are never compared more than once, the total number of comparisons performed by quicksort can be expressed as the sum over all pairs of keys.
Let $X_n$ be the random variable denoting the total number of comparisons performed in an execution of quicksort with a randomly permuted input of size latexmath:[$n$.
We want to bound the expectation of $X_n$, $E \lbrack X_n \rbrack$.
For the analysis, let’s consider the final sorted order of the keys $T$, which corresponds to the output. Consider two positions $i, j \in \{1, \dots, n\}$ in the sequence $T$. We define following random variable:
We can write $X_n$ by summing over all $A_{ij}$'s:
By linearity of expectation, we have
Furthermore, since each $A_{ij}$ is an indicator random variable, $E \lbrack A_{ij} \rbrack = P(A_{ij}) = 1$. Our task therefore comes down to computing the probability that $T_i$ and $T_j$ are compared, i.e., $P(A_{ij} = 1)$, and working out the sum.
To compute this probability, note that each call takes the pivot $p$ and splits the sequence into two parts, one with keys larger than $p$ and the other with keys smaller than $p$. For any one call to quicksort there are three possibilities as illustrated in the figure above.
-
The pivot is (equal to) either $T_i$ or $T_j$, in which case $T_i$ and $T_j$ are compared and $A_{ij} = 1$. Since there are $j-i+1$ keys in the interval $T_i \ldots T_j$ and since each one is equally likely to be the first in the randomly permuted input, the $P(A_{ij} = 1) = \frac{2}{j-i+1}$.
-
The pivot is a key between $T_i$ and $T_j$, and $T_i$ and $T_j$ will never be compared; thus $A_{ij} = 0$.
-
The pivot is less than $T_i$ or greater than $T_j$. Then latexmath:[$T_i$ and $T_j$ are sent to the same recursive call. Whether $T_i$ and $T_j$ are compared will be determined in some later call.
The last step follows by the fact that $H_n = \ln{n} + O(1)$.
Having completed work, let’s analyze the span of quicksort. Recall that each call to quicksort partitions the input sequence of length $n$ into three subsequences $L$, $E$, and $R$, consisting of the elements less than, equal to, and greater than the pivot.
Let’s first bound the size of the left sequence $L$.
By symmetry, $E \lbrack |R| \rbrack \le \frac{n}{2}$. This reasoning applies at any level of the quicksort call tree. In other words, the size of the input decreases by $1/2$ in expectation at each level in the call tree.
Since pivot choice at each call is independent of the other calls. The expected size of the input at level $i$ is $E \lbrack Y_i \rbrack = \frac{n}{2^i}$.
Let’s calculate the expected size of the input at depth $i = 5\lg{n}$. By basic arithmetic, we obtain
Since $Y_i$'s are always non-negative, we can use Markov’s inequality, to turn expectations into probabilities as follows.
In other words, the probability that a given path in the quicksort call tree has a depth that exceeds $5 \lg{n} $ is tiny.
From this bound, we can calculute a bound on the depth of the whole tree. Note first that the tree has exactly $n+1$ leaves because it has $n$ internal nodes. Thus we have at most $n+1$ to consider. The probability that a given path expeeds a depth of $5\lg{n}$ is $n^{-4}$. Thus, by union bound the probability that any one of the paths exceed the depeth of $5\lg{n}$ is $(n+1) \cdot n^{-4} \le n^{-2}$. Thus the probability that the depth of the tree is greater than $5\lg{n}$ is $n^{-2}$.
By using Total Expectation Theorem or the Law of total expectation, we can now calculate expected span by dividing the sample space into mutually exclusive and exhaustive space as follows.
In thes bound, we used the fact that each call to quicksort has a span
of $O(\lg{n})$ because this is the span of filter
.
Here is an alternative analysis.
Let $M_n = \max\{|L|, |R|\}$, which is the size of larger
subsequence. The span of quicksort is determined by the sizes of these
larger subsequences. For ease of analysis, we will assume that
$|E| = 0$, as more equal elements will only decrease the
span. As the partition step uses filter
we have the following
recurrence for span:
To develop some intuition for the span analysis, let’s consider the probability that we split the input sequence more or less evenly. If we select a pivot that is greater than $T_{n/4}$ and less than $T_{3n/4}$] then $M_n$ is at most $3n/4$. Since all keys are equally likely to be selected as a pivot this probability is $\frac{3n/4 - n/4}{n} = 1/2$. The figure below illustrates this.
This observations implies that at each level of the call tree (every time a new pivot is selected), the size of the input to both calls decrease by a constant fraction (of $4/3$). At every two levels, the probability that the input size decreases by latexmath:[$4/3$ is the probability that it decreases at either step, which is at least $3/4$, etc. Thus at a small costant number of steps, the probability that we observe a $4/3$ factor decrease in the size of the input approches $1$ quickly. This suggest that at some after $c\log{n}$ levels quicksort should complete. We now make this intuition more precise.
For the analysis, we use the conditioning technique for computing expectations as suggested by the total expectation theorem. Let $X$ be a random variable and let $A_i$ be disjoint events that form a a partition of the sample space such that $P(A_i) > 0$. The Total Expectation Theorem or the Law of total expectation states that
Note first that $P(X_n \leq 3n/4) = 1/2$, since half of the randomly chosen pivots results in the larger partition to be at most $3n/4$ elements: any pivot in the range $T_{n/4}$ to $T_{3n/4}$ will do, where $T$ is the sorted input sequence.
By conditioning $S_n$ on the random variable $M_n$, we write,
We can re-write this
The rest is algebra
This is a recursion in $E \lbrack S(\cdot) \rbrack $ and solves easily to $E \lbrack S(n) \rbrack = O(\log^2 n)$.
10.1.2. Observable Work Efficency and Scalability
For an implementation to be observably work efficient, we know that we
must control granularity by switching to a fast sequential sorting
algorithm when the input is small. This is easy to achieve using our
granularity control technique by using seqsort()
, a fast sequential
algorithm provided in the code base; seqsort()
is really a call to
STL’s sort function. Of course, we have to assess observable work
efficiency experimentally after specifying the implementation.
The code for quicksort is shown below. Note that we use our array
class sparray
to store the input and output. To partition the
input, we use our parallel filter
function from the previous lecture
to parallelize the partitioning phase. Similarly, we use our parallel
concatenation function to constructed the sorted output.
controller_type quicksort_contr("quicksort"); sparray quicksort(const sparray& xs) { long n = xs.size(); sparray result = { }; cstmt(quicksort_contr, [&] { return n * std::log2(n); }, [&] { if (n == 0) { result = { }; } else if (n == 1) { result = { xs[0] }; } else { value_type p = xs[0]; sparray less = filter([&] (value_type x) { return x < p; }, xs); sparray equal = filter([&] (value_type x) { return x == p; }, xs); sparray greater = filter([&] (value_type x) { return x > p; }, xs); sparray left = { }; sparray right = { }; fork2([&] { left = quicksort(less); }, [&] { right = quicksort(greater); }); result = concat(left, equal, right); } }, [&] { result = seqsort(xs); }); return result; }
By using randomized-analysis techniques, it is possible to analyze the work and span of this algorithm. The techniques needed to do so are beyond the scope of this book. The interested reader can find more details in another book.
One consequence of the work and span bounds that we have stated above is that our quicksort algorithm is highly parallel: its average parallelism is $\frac{O(n \log n)}{O(\log^2 n)} = \frac{n}{\log n}$. When the input is large, there should be ample parallelism to keep many processors well fed with work. For instance, when $n = 100$ million items, the average parallelism is $\frac{10^8}{\log 10^8} \approx \frac{10^8}{40} \approx 3.7$ million. Since 3.7 million is much larger than the number of processors in our machine, that is, forty, we have a ample parallelism.
Unfortunately, the code that we wrote leaves much to be desired in terms of observable work efficiency. Consider the following benchmarking runs that we performed on our 40-processor machine.
$ prun speedup -baseline "bench.baseline" -parallel "bench.opt -proc 1,10,20,30,40" -bench quicksort -n 100000000
The first two runs show that, on a single processor, our parallel algorithm is roughly 6x slower than the sequential algorithm that we are using as baseline! In other words, our quicksort appears to have "6-observed work efficiency". That means we need at least six processors working on the problem to see even a small improvement compared to a good sequential baseline.
[1/6]
bench.baseline -bench quicksort -n 100000000
exectime 12.518
[2/6]
bench.opt -bench quicksort -n 100000000 -proc 1
exectime 78.960
The rest of the results confirm that it takes about ten processors to see a little improvement and forty processors to see approximately a 2.5x speedup. This plot shows the speedup plot for this program. Clearly, it does not look good.
[3/6]
bench.opt -bench quicksort -n 100000000 -proc 10
exectime 9.807
[4/6]
bench.opt -bench quicksort -n 100000000 -proc 20
exectime 6.546
[5/6]
bench.opt -bench quicksort -n 100000000 -proc 30
exectime 5.531
[6/6]
bench.opt -bench quicksort -n 100000000 -proc 40
exectime 4.761
Our analysis suggests that we have a good parallel algorithm for quicksort, yet our observations suggest that, at least on our test machine, our implementation is rather slow relative to our baseline program. In particular, we noticed that our parallel quicksort started out being 6x slower than the baseline algorithm. What could be to blame?
In fact, there are many implementation details that could be to blame. The problem we face is that identifying those causes experimentally could take a lot of time and effort. Fortunately, our quicksort code contains a few clues that will guide us in a good direction.
It should be clear that our quicksort is copying a lot of data and, moreover, that much of the copying could be avoided. The copying operations that could be avoided, in particular, are the array copies that are performed by each of the the three calls to filter and the one call to concat. Each of these operations has to touch each item in the input array.
Let us now consider a (mostly) in-place version of quicksort. This code is mostly in place because the algorithm copies out the input array in the beginning, but otherwise sorts in place on the result array. The code for this algorithm appears just below.
sparray in_place_quicksort(const sparray& xs) { sparray result = copy(xs); long n = xs.size(); if (n == 0) { return result; } in_place_quicksort_rec(&result[0], n); return result; }
controller_type in_place_quicksort_contr("in_place_quicksort"); void in_place_quicksort_rec(value_type* A, long n) { if (n < 2) { return; } cstmt(in_place_quicksort_contr, [&] { return nlogn(n); }, [&] { value_type p = A[0]; value_type* L = A; // below L are less than pivot value_type* M = A; // between L and M are equal to pivot value_type* R = A+n-1; // above R are greater than pivot while (true) { while (! (p < *M)) { if (*M < p) std::swap(*M,*(L++)); if (M >= R) break; M++; } while (p < *R) R--; if (M >= R) break; std::swap(*M,*R--); if (*M < p) std::swap(*M,*(L++)); M++; } fork2([&] { in_place_quicksort_rec(A, L-A); }, [&] { in_place_quicksort_rec(M, A+n-M); // Exclude all elts that equal pivot }); }, [&] { std::sort(A, A+n); }); }
We have good reason to believe that this code is, at least, going to be more work efficient than our original solution. First, it avoids the allocation and copying of intermediate arrays. And, second, it performs the partitioning phase in a single pass. There is a catch, however: in order to work mostly in place, our second quicksort code sacrified on parallelism. In specific, observe that the partitioning phase is now sequential. The span of this second quicksort is therefore linear in the size of the input and its average parallelism is therefore logarithmic in the size of the input.
So, we expect that the second quicksort is more work efficient but should scale poorly. To test the first hypothesis, let us run the second quicksort on a single processor.
$ bench.opt -bench in_place_quicksort -n 100000000 -proc 1
Indeed, the running time of this code is essentially same as what we observed for our baseline program.
exectime 12.500
total_idle_time 0.000
utilization 1.0000
result 1048575
Now, let us see how well the second quicksort scales by performing another speedup experiment.
$ prun speedup -baseline "bench.baseline" -parallel "bench.opt -proc 1,20,30,40" -bench quicksort,in_place_quicksort -n 100000000
[1/10]
bench.baseline -bench quicksort -n 100000000
exectime 12.031
[2/10]
bench.opt -bench quicksort -n 100000000 -proc 1
exectime 68.998
[3/10]
bench.opt -bench quicksort -n 100000000 -proc 20
exectime 5.968
[4/10]
bench.opt -bench quicksort -n 100000000 -proc 30
exectime 5.115
[5/10]
bench.opt -bench quicksort -n 100000000 -proc 40
exectime 4.871
[6/10]
bench.baseline -bench in_place_quicksort -n 100000000
exectime 12.028
[7/10]
bench.opt -bench in_place_quicksort -n 100000000 -proc 1
exectime 12.578
[8/10]
bench.opt -bench in_place_quicksort -n 100000000 -proc 20
exectime 1.731
[9/10]
bench.opt -bench in_place_quicksort -n 100000000 -proc 30
exectime 1.697
[10/10]
bench.opt -bench in_place_quicksort -n 100000000 -proc 40
exectime 1.661
Benchmark successful.
Results written to results.txt.
$ pplot speedup -series bench
The plot below shows one speedup curve for each of our two quicksort implementations. The in-place quicksort is always faster. However, the in-place quicksort starts slowing down a lot at 20 cores and stops after 30 cores. So, we have one solution that is observably not work efficient and one that is, and another that is the opposite. The question now is whether we can find a happy middle ground.
Tip
|
Eliminate unecessary copying and array allocations. |
Tip
|
Eliminate redundant work by building the partition in one pass instead of three. |
Tip
|
Find a solution that has the same span as our first quicksort code. |
We encourage students to look for improvements to quicksort independently. For now, we are going to consider parallel mergesort. This time, we are going to focus more on achieving better speedups.
10.2. Mergesort
As a divide-and-conquer algorithm, the mergesort algorithm, is a good candidate for parallelization, because the two recursive calls for sorting the two halves of the input can be independent. The final merge operation, however, is typically performed sequentially. It turns out to be not too difficult to parallelize the merge operation to obtain good work and span bounds for parallel mergesort. The resulting algorithm turns out to be a good parallel algorithm, delivering asymptotic, and observably work efficiency, as well as low span.
This process requires a "merge" routine which merges the contents of two specified subranges of a given array. The merge routine assumes that the two given subarrays are in ascending order. The result is the combined contents of the items of the subranges, in ascending order.
The precise signature of the merge routine appears below and its
description follows. In mergesort, every pair of ranges that are
merged are adjacent in memory. This observation enables us to write
the following function. The function merges two ranges of source array
xs
: [lo, mid)
and [mid, hi)
. A temporary array tmp
is used as
scratch space by the merge operation. The function writes the result
from the temporary array back into the original range of the source
array: [lo, hi)
.
void merge(sparray& xs, sparray& tmp, long lo, long mid, long hi);
sparray xs = { // first range: [0, 4) 5, 10, 13, 14, // second range: [4, 9) 1, 8, 10, 100, 101 }; merge(xs, sparray(xs.size()), (long)0, 4, 9); std::cout << "xs = " << xs << std::endl;
Output:
xs = { 1, 5, 8, 10, 10, 13, 14, 100, 101 }
To see why sequential merging does not work, let us implement the merge
function by using one provided by STL: std::merge()
. This merge
implementation performs linear work and span in the number of items
being merged (i.e., hi - lo
). In our code, we use this STL
implementation underneath the merge()
interface that we described
just above.
Now, we can assess our parallel mergesort with a sequential merge, as implemented by the code below. The code uses the traditional divide-and-conquer approach that we have seen several times already.
The code is asymptotically work efficient, because nothing significant has changed between this parallel code and the serial code: just erase the parallel annotations and we have a textbook sequential mergesort!
sparray mergesort(const sparray& xs) { long n = xs.size(); sparray result = copy(xs); mergesort_rec(result, sparray(n), (long)0, n); return result; } controller_type mergesort_contr("mergesort"); void mergesort_rec(sparray& xs, sparray& tmp, long lo, long hi) { long n = hi - lo; cstmt(mergesort_contr, [&] { return n * std::log2(n); }, [&] { if (n == 0) { // nothing to do } else if (n == 1) { tmp[lo] = xs[lo]; } else { long mid = (lo + hi) / 2; fork2([&] { mergesort_rec(xs, tmp, lo, mid); }, [&] { mergesort_rec(xs, tmp, mid, hi); }); merge(xs, tmp, lo, mid, hi); } }, [&] { if (hi-lo < 2) return; std::sort(&xs[lo], &xs[hi-1]+1); }); }
Unfortunately, this implementation has a large span: it is linear, owing to the sequential merge operations after each pair of parallel calls. More precisely, we can write the work and span of this implementation as follows:
It is not difficult to show that these recursive equation solve to $W(n) = \Theta (n\log{n})$ and $S(n) = \Theta (n)$.
With these work and span costs, the average parallelism of our solution is $\frac{cn \log n}{2cn} = \frac{\log n}{2}$. Consider the implication: if $n = 2^{30}$, then the average parallelism is $\frac{\log 2^{30}}{2} = 15$. That is terrible, because it means that the greatest speedup we can ever hope to achieve is 15x!
The analysis above suggests that, with sequential merging, our
parallel mergesort does not expose ample parallelism. Let us put that
prediction to the test. The following experiment considers this
algorithm on our 40-processor test machine. We are going to sort a
random sequence of 100 million items. The baseline sorting algorithm
is the same sequential sorting algorithm that we used for our
quicksort experiments: std::sort()
.
$ prun speedup -baseline "bench.baseline" -parallel "bench.opt -proc 1,10,20,30,40" -bench mergesort_seqmerge -n 100000000
The first two runs suggest that our mergesort has better observable work efficiency than our quicksort. The single-processor run of parallel mergesort is roughly 50% slower than that of the sequential baseline algorithm. Compare that to the 6x-slower running time for single-processor parallel quicksort! We have a good start.
[1/6]
bench.baseline -bench mergesort_seqmerge -n 100000000
exectime 12.483
[2/6]
bench.opt -bench mergesort_seqmerge -n 100000000 -proc 1
exectime 19.407
The parallel runs are encouraging: we get 5x speedup with 40 processors.
[3/6]
bench.opt -bench mergesort_seqmerge -n 100000000 -proc 10
exectime 3.627
[4/6]
bench.opt -bench mergesort_seqmerge -n 100000000 -proc 20
exectime 2.840
[5/6]
bench.opt -bench mergesort_seqmerge -n 100000000 -proc 30
exectime 2.587
[6/6]
bench.opt -bench mergesort_seqmerge -n 100000000 -proc 40
exectime 2.436
But we can do better by using a parallel merge instead of a sequential
one: the speedup plot in [mergesort-speedups] shows three
speedup curves, one for each of three mergesort algorithms. The
mergesort()
algorithm is the same mergesort routine that we have
seen here, except that we have replaced the sequential merge step by
our own parallel merge algorithm. The cilksort()
algorithm is the
carefully optimized algorithm taken from the Cilk benchmark
suite. What this plot shows is, first, that the parallel merge
significantly improves performance, by at least a factor of two. The
second thing we can see is that the optimized Cilk algorithm is just a
little faster than the one we presented here. That’s pretty good,
considering the simplicity of the code that we had to write.
It turns out that we can do better by simply changing some of the variables in our experiment. The plot shown in [better-speedups] shows the speedup plot that we get when we change two variables: the input size and the sizes of the items. In particular, we are selecting a larger number of items, namely 250 million instead of 100 million, in order to increase the amount of parallelism. And, we are selecting a smaller type for the items, namely 32 bits instead of 64 bits per item. The speedups in this new plot get closer to linear, topping out at approximately 20x.
Practically speaking, the mergesort algorithm is memory bound because
the amount of memory used by mergesort and the amount of work
performed by mergesort are both approximately roughly linear. It is an
unfortunate reality of current multicore machines that the main
limiting factor for memory-bound algorithms is amount of parallelism
that can be achieved by the memory bus. The memory bus in our test
machine simply lacks the parallelism needed to match the parallelism
of the cores. The effect is clear after just a little experimentation
with mergesort. You can see this effect yourself, if you are
interested to change in the source code the type aliased by
value_type
. For a sufficiently large input array, you should observe
a significant performance improvement by changing just the
representation of value_type
from 64 to 32 bits, owing to the fact
that with 32-bit items is a greater amount of computation relative to
the number of memory transfers.
11. Chapter: Parallelism and Concurrency
We distinguish between parallelism and concurrency.
We use the term parallel to refer to an algorithm or application that performs multiple computations at the same time for the purposes of improving the completion or run time of the algorithm or application. The input-output behavior or the extrinsic semantics of a parallel computation can be defined purely sequentially, even though the computation itself executes in parallel.
We use the term concurrent to refer to a computation that involves independent agents, which can be implemented with processes or threads, that communicate and coordinate to accomplish the intended result. The input-output behavior or the extrinsic semantics of a concurrent application cannot be defined purely sequentially; we must consider the interaction between different threads.
In other words, parallelism is a property of the platform (software and/or hardware) on which the computation takes place, whereas concurrency is a property of the application. A concurrent program can be executed serially or in parallel.
For example, the sorting algorithms that we covered earlier are parallel but they are not concurrent, because the algorithms do not involve communication between different, independent agents; an operating system on the other hand is a concurrent application, where many processes (agents) may run at the same time and communicate to accomplish a task. For example, when you request these notes to be downloaded to your laptop, the kernel, the file system, and your browser communicate to accomplish the task. Web browsers are typically written as concurrent applications because they may include multiple processes to communicate with the operating system, the network, and the user, all of which coordinate as needed.
Parallelism and concurrency are orthogonal dimensions in the space of all applications. Some applications are concurrent, some are not. Many concurrent applications can benefit from parallelism. For example, a browser, which is a concurrent application itself as it may use a parallel algorithm to perform certain tasks. On the other hand, there is often no need to add concurrency to a parallel application, because this unnecessarily complicates software. It can, however, lead to improvements in efficiency.
The following quote from Dijkstra suggest pursuing the approach of making parallelism just a matter of execution (not one of semantics), which is the goal of the much of the work on the development of programming languages for parallel programs today.
— Edsger W. Dijkstra
11.1. Race conditions and concurrency
Even though concurrent behavior is not necessary in parallel computing, it can arise as a result of shared state. When shared variables are used for communication, for example by reading and writing the variable, between threads, we may have to treat the computation as a concurrent program to understand its behavior.
Consider a simple example function, which increments a counter in
parallel for a specified number of times $n$. The code
for such a function is shown below. The function concurrent_counter
takes a number n
and uses a parallel-for loop to increment the
variable counter
. When called with a large n
, this function will
create many threads that increment counter
in parallel.
loop_controller_type concurrent_counter_contr ("parallel for"); void concurrent_counter(long n) { long counter = 0; auto incr = [&] () { counter = counter + 1; }; par::parallel_for(concurrent_counter_contr, 0l, n, [&] (long i) { incr(); }); std::cout << "Concurrent-counter: " << "n = " << n << " result = " << counter << std::endl; }
Since it is implemented by using our parallelism constructs, we may
feel that the function concurrent_counter
is parallel. To define
its behavior, we may thus conclude that we can ignore the parallelism
constructs, because such constructs only impact the way that the
function is executed but not its semantics. So we may describe the
semantics (input-output behavior) of concurrent_counter
as taking a
value n
, incrementing it n
times, and since counter
starts out
at 0
, printing the value n
in the end. But this is hardly the
case, as shown below by several runs of the function.
$ make example.opt
...
// problem size = 1K, counter = 1K
$ ./example.opt -example concurrent_counter -n 1000 -proc 16
Concurrent-counter:n = 1000 result = 1000
exectime 0.000
total_idle_time 0.000
utilization 0.9790
// problem size = 100M, counter < 10M.
$ ./example.opt -example concurrent_counter -n 100000000 -proc 16
Concurrent-counter:n = 100000000 result = 6552587
exectime 0.016
total_idle_time 0.046
utilization 0.8236
// problem size = 100M, counter < 10M.
$ ./example.opt -example concurrent_counter -n 100000000 -proc 24
Concurrent-counter:n = 100000000 result = 5711624
exectime 0.015
total_idle_time 0.046
utilization 0.8743
// problem size = 100M, counter ~ 10M.
$ ./example.opt -example concurrent_counter -n 100000000 -proc 24
Concurrent-counter:n = 100000000 result = 10785626
exectime 0.020
total_idle_time 0.128
utilization 0.7338
What has gone wrong is that the function concurrent_counter
is in
fact concurrent—not parallel—because the counter is a variable that
is shared by many parallel threads, each of which read the value and
update it by using multiple instructions. In fact, as we will see, the
parallel updates to the counter lead to a race condition. More
generally, a program written with our parallelism constructs is only
truly parallel if and only if parallel computations are
disentangled, i.e., they don’t interact by reading and writing
memory, as for example in purely functional programs. Note that this
restriction holds only for parts of the computation that execute in
parallel, those that are not parallel can read and write from memory.
To understand more specifically what goes wrong, we need to rewrite
our example to reflect the code, shown below, as executed by our
machines. The only difference from the previous example is the incr
function, which now uses separate operations to read the counter, to
calculate the next value, and to write into it. The point is that the
computation of reading-a-value-and-updating-it is not an atomic
operation that executes in one step but a composite operation that
takes place in three steps:
-
read
counter
into local, non-shared variablecurrent
, -
increment
current
, and -
write the updated value of
current
into shared variablecounter
.
loop_controller_type concurrent_counter_contr ("parallel for"); void concurrent_counter(long n) { long counter = 0; auto incr = [&] () { long current = counter; current = current + 1; counter = current; }; par::parallel_for(concurrent_counter_contr, 0l, n, [&] (long i) { incr(); }); std::cout << "Concurrent-counter: " << "n = " << n << " result = " << counter << std::endl; }
Since in a parallel execution, multiple threads can execute the
function incr
in parallel, the shared variable counter
may not be
updated as expected. To see this, consider just two parallel
executions of incr
that both start out the current value of
counter
let`s say i
, locally calculate i+1
, and write it back.
The resulting value of counter
is thus i+1
even though incr
has
been applied twice. This is the reason for missed increments
illustrated by the runs above.
11.2. Eliminating race conditions by synchronization
As discussed in an earlier chapter, we can
use synchronization operations provided by modern hardware to
eliminate race codnitions. In particular, we can use compare-and-swap
in our example. Since compare-and-swap allows us to update
the value stored at a location atomically, we can use it to implement
our increment function as shown below. To this end, we declare
counter
as a atomic long type; the syntax std::atomic<long>
declares an atomic long
. In the function incr
, we load the value
of the counter into a local variable current
and use
compare-and-swap to replace the value of the counter
with
current+1
by calling compare_exchange_strong
, which implements the
compare-and-swap operation. Since compare-and-swap can fail, because
of another thread succeeding to update at the same time, function
incr
tries this update cycle until the operation succeeds.
loop_controller_type concurrent_counter_atomic_contr ("parallel for"); void concurrent_counter_atomic(long n) { std::atomic<long> counter; counter.store(0); auto incr = [&] () { while (true) { long current = counter.load(); if (counter.compare_exchange_strong (current,current+1)) { break; } } }; par::parallel_for(concurrent_counter_atomic_contr, 0l, n, [&] (long i) { incr(); }); std::cout << "Concurrent-counter-atomic: " << "n = " << n << " result = " << counter << std::endl; }
Here are some example runs of this code with varying number of processors. Note that the counter is always updated correctly, but the execution takes significantly longer time (approximately 10 times slower, accounting for the loss of parallelism due to the atomic operation) than the incorrect implementation with the race condition. This shows how expensive synchronization operations can be. Intuitively, synchronization operations are expensive, because they can require atomic access to memory and because they disable certain compiler optimizations. The reality is significantly more complex, because modern multicore computers have memory models that are only weakly consistent and because they rely on several levels of caches, which are nevertheless kept consistent by using "cache coherency protocols".
$ ./example.opt -example concurrent_counter_atomic -n 1000 -proc 16
Concurrent-counter-atomic:n = 1000 result = 1000
exectime 0.000
total_idle_time 0.000
utilization 0.9796
$ ./example.opt -example concurrent_counter_atomic -n 100000000 -proc 16
Concurrent-counter-atomic:n = 100000000 result = 100000000
exectime 14.718
total_idle_time 0.082
utilization 0.9997
$ ./example.opt -example concurrent_counter_atomic -n 100000000 -proc 24
Concurrent-counter-atomic:n = 100000000 result = 100000000
exectime 7.570
total_idle_time 0.388
utilization 0.9979
12. Chapter: Graphs
In just the past few years, a great deal of interest has grown for frameworks that can process very large graphs. Interest comes from a diverse collection of fields. To name a few: physicists use graph frameworks to simulate emergent properties from large networks of particles; companies such as Google mine the web for the purpose of web search; social scientists test theories regarding the origins of social trends.
In response, many graph-processing frameworks have been implemented both in academia and in the industry. Such frameworks offer to client programs a particular application programming interface. The purpose of the interface is to give the client programmer a high-level view of the basic operations of graph processing. Internally, at a lower level of abstraction, the framework provides key algorithms to perform basic functions, such as one or more functions that "drive" the traversal of a given graph.
The exact interface and the underlying algorithms vary from one graph-processing framework to another. One commonality among the frameworks is that it is crucial to harness parallelism, because interesting graphs are often huge, making it practically infeasible to perform sequentially interesting computations.
12.1. Graph representation
We will use an adjacency lists representation based on compressed arrays to represent directed graphs. In this representation, a graph is stored as a compact array containing the neighbors of each vertex. Each vertex in the graph $G = (V, E)$ is assigned an integer identifier $v \in \{ 0, \ldots, n-1 \}$, where $n = |V|$ is the number of vertices in the graph. The representation then consists of two array.
-
The edge array contains the adjacency lists of all vertices ordered by the vertex ids.
-
The vertex array stores an index for each vertex that indicates the starting position of the adjacency list for that vertex in the edge array. This array implements the
consisting of the vertex and the edge arrays. The sentinel value "-1" is used to indicate a non-vertex id.
The compressed-array representation requires a total of $n + m$ vertex-id cells in memory, where $n = |V|$ and $m = |E|$. Furthermore, it involves very little indirection, making it possible to perform many interesting graph operations efficiently. For example, we can determine the out-neighbors af a given vertex with constant work. Similarly, we can determine the out-degree of a given vertex with constant work.
Space use is a major concern because graphs can have tens of billions of edges or more. The Facebook social network graph (including just the network and no metadata) uses 100 billion edges, for example, and as such could fit snugly into a machine with 2TB of memory. Such a large graph is a greater than the capacity of the RAM available on current personal computers. But it is not that far off, and there are many other interesting graphs that easily fit into just a few gigabytes. For simplicity, we always use 64 bits to represent vertex identifiers; for small graphs 32-bit representation can work just as well.
We implemented the adjacency-list representation based on compressed
arrays with a class called adjlist
.
using vtxid_type = value_type; using neighbor_list = const value_type*; class adjlist { public: long get_nb_vertices() const; long get_nb_edges() const; long get_out_degree_of(vtxid_type v) const; neighbor_list get_out_edges_of(vtxid_type v) const; };
Sometimes it is useful for testing and debugging purposes to create a graph from a handwritten example. For this purpose, we define a type to express an edge. The type is a pair type where the first component of the pair represents the source and the second the destination vertex, respectively.
using edge_type = std::pair<vtxid_type, vtxid_type>;
In order to create an edge, we use the following function, which takes a source and a destination vertex and returns the corresponding edge.
edge_type mk_edge(vtxid_type source, vtxid_type dest) { return std::make_pair(source, dest); }
Now, specifying a (small) graph in textual format is as easy as
specifying an edge list. Moreover, getting a textual representation of
the graph is as easy as printing the graph by cout
.
adjlist graph = { mk_edge(0, 1), mk_edge(0, 3), mk_edge(5, 1), mk_edge(3, 0) }; std::cout << graph << std::endl;
Output:
digraph {
0 -> 1;
0 -> 3;
3 -> 0;
5 -> 1;
}
Note
|
The output above is an instance of the "dot" format. This format is used by a well-known graph-visualization tool called graphviz. The diagram below shows the visualization of our example graph that is output by the graphviz tool. You can easily generate such visualizations for your graphs by using online tools, such is Click this one. |
adjlist graph = { mk_edge(0, 1), mk_edge(0, 3), mk_edge(5, 1), mk_edge(3, 0), mk_edge(3, 5), mk_edge(3, 2), mk_edge(5, 3) }; std::cout << "nb_vertices = " << graph.get_nb_vertices() << std::endl; std::cout << "nb_edges = " << graph.get_nb_edges() << std::endl; std::cout << "neighbors of vertex 3:" << std::endl; neighbor_list neighbors_of_3 = graph.get_out_edges_of(3); for (long i = 0; i < graph.get_out_degree_of(3); i++) std::cout << " " << neighbors_of_3[i]; std::cout << std::endl;
Output:
nb_vertices = 6
nb_edges = 7
neighbors of vertex 3:
0 5 2
Next, we are going to study a version of breadth-first search that is useful for searching in large in-memory graphs in parallel. After seeing the basic pattern of BFS, we are going to generalize a little to consider general-purpose graph-traversal techniques that are useful for implementing a large class of parallel graph algorithms.
12.2. Breadth-first search
The breadth-first algorithm is a particular graph-search algorithm that can be applied to solve a variety of problems such as finding all the vertices reachable from a given vertex, finding if an undirected graph is connected, finding (in an unweighted graph) the shortest path from a given vertex to all other vertices, determining if a graph is bipartite, bounding the diameter of an undirected graph, partitioning graphs, and as a subroutine for finding the maximum flow in a flow network (using Ford-Fulkerson’s algorithm). As with the other graph searches, BFS can be applied to both directed and undirected graphs.
The idea of breadth first search, or BFS for short, is to start at a source vertex $s$ and explore the graph outward in all directions level by level, first visiting all vertices that are the (out-)neighbors of $s$ (i.e. have distance 1 from $s$), then vertices that have distance two from $s$, then distance three, etc. More precisely, suppose that we are given a graph $G$ and a source $s$. We define the level of a vertex $v$ as the shortest distance from $s$ to $v$, that is the number of edges on the shortest path connecting $s$ to $v$.
A graph, where each vertex is labeled with its level.
At a high level, BFS algorithm maintains a set of vertices called
visited
, which contain the vertices that have been visited, and
a set of vertices called frontier
, which contain the vertices
that are not visited but that are adjacent to a visited vertex. It
then visits a vertex in the frontier and adds its out-neighbors to the
frontier.
12.2.1. Sequential BFS
Many variations of BFS have been proposed over the years. The one that
may be most widely known is the classic sequential BFS that uses a
FIFO queue to represent the frontier
. The visited
set can be
represented as some array data structure, or can be represented
implicitly by keeping a flat at each vertex that indicating whether
the vertex is visited or not.
void sequential_bfs (G = (V,E), s)
frontier = <s>
visited = {}
while frontier is not <> do
let <v,frontier_n> be frontier
if v is not in visited then
visit v
foreach out-neighbor u of v do
frontier_n = append frontier_n <u>
frontier = frontier_n
12.2.2. Parallel BFS
Our goal is to design and implement a parallel algorithm for BFS that is observably work efficient and has plenty of parallelism. There is natural parallelism in BFS because the vertices in each level can actually be visited in parallel, as shown in the pseudo-code below.
frontier = {source}
visited = {}
level = 0
while frontier is not {} do
next = {}
let {v_1, ..., v_m} be frontier
parallel for i = 1 to m do
visit v_i
visited = visited set-union frontier
next = out_neighbors(v_1) set-union ... set-union out_neighbors(v_m)
frontier = next set-difference visited
level = level + 1
Note that we can also compute the next set (frontier) in parallel by performing a reduce with the set-union operation, and then by taking a set-difference operation.
Our goal is to implement an observably work-efficient version of this
algorithm on a hardware-shared memory parallel machine such as a
modern multicore computer. The key challenge is implementing the set
operations on the visited
, frontier
, and next
sets. Apart from
maintaining a visited map to prevent a vertex from being visited more
than once, the serial algorithm does not have to perform these
operations.
To achieve work efficiently, we will use atomic read-modify-write operations, specifically compare-and-swap, to mark visited vertices and use an array representation for the frontier. To achieve observable work efficiency, we will change the notion of the frontier slightly. Instead of holding the vertices that we are will visit next, the frontier will hold the vertices we just visited. At each level, we will visit the neighbors of the vertices in the frontier, but only if they have not yet been visited. This guard is necessary, because two vertices in the frontier can both have the same vertex $v$ as their neighbor, causing $v$ to be visited multiple times. After we visit all the neighbors of the vertices in the frontier at this level, we assign the frontier to be the vertices visited. The pseudocode for this algorithm is shown below.
level = 0
parallel for i = 1 to n do
visited[i] = false
visit source
visited[source] = true
frontier = {source}
while frontier is not {} do
level = level + 1
let {v_1, ..., v_m} be frontier
parallel for i = 1 to m do
let {u_1, ..., u_l} be out-neighbors(v_i)
parallel for j = 1 to l do
if compare_and_swap (&visited[u_j], false, true) succeeds then
visit u_j
frontier = vertices visited at this level
As show in the pseudocode, there are at least two clear opportunities for parallelism. The first is the parallel loop that processes the frontier and the second the parallel for loop that processes the neighbors of a vertex. These two loops should expose a lot of parallelism, at least for certain classes of graphs. The outer loop exposes parallelism when the frontier gets to be large. The inner loop exposes parallelism when the traversal reaches a vertex that has a high out degree.
To keep track of the visited vertices, the pseudo-code use an array of
booleans visited[v]
of size $n$ that is keyed by the
vertex identifier. If visited[v] == true
, then vertex v
has been
visited already and has not otherwise. We used an atomic
compare-and-swap operation to update the visited array because
otherwise vertices can be visited multiple times. To see this suppose
now that two processors, namely $A$ and $B$,
concurrently attempt to visit the same vertex v
(via two different
neighbors of v
) but without the use of atomic operations. If
$A$ and $B$ both read visited[v]
at the same
time, then both consider that they can visit v
. Both processors then
mark v
as visited and then proceed to visit the neighbors of v
. As
such, v
will be visited twice and subsequently have its outgoing
neighbors processed twice.
In the rest of this section, we describe more precisely how to implement the parallel BFS algorithm in C++.
In C++, we can we can use lightweight atomic memory, as described in this chapter to eliminate race conditions. The basic idea is to guard each cell in our "visited" array by an atomic type.
Access to the contents of any given cell is achieved by the load()
and store()
methods.
const long n = 3; std::atomic<bool> visited[n]; long v = 2; visited[v].store(false); std::cout << visited[v].load() << std::endl; visited[v].store(true); std::cout << visited[v].load() << std::endl;
Output:
0
1
The key operation that enables us to eliminate the race condition is the compare and exchange operation. This operation performs the following steps, atomically:
-
Read the contents of the target cell in the visited array.
-
If the contents is false (i.e., equals the contents of
orig
), then writetrue
into the cell and returntrue
. -
Otherwise, just return
false
.
const long n = 3; std::atomic<bool> visited[n]; long v = 2; visited[v].store(false); bool orig = false; bool was_successful = visited[v].compare_exchange_strong(orig, true); std::cout << "was_successful = " << was_successful << "; visited[v] = " << visited[v].load() << std::endl; bool orig2 = false; bool was_successful2 = visited[v].compare_exchange_strong(orig2, true); std::cout << "was_successful2 = " << was_successful2 << "; visited[v] = " << visited[v].load() << std::endl;
Output:
was_successful = 1; visited[v] = 1
was_successful2 = 0; visited[v] = 1
So far, we have seen pseudocode that describes at a high level the idea behind the parallel BFS. We have seen that special care is required to eliminate problematic race conditions. Let’s now put these ideas together to complete and implementation. The following function signature is the signature for our parallel BFS implementation. The function takes as parameters a graph and the identifier of a source vertex and returns an array of boolean flags.
sparray bfs(const adjlist& graph, vtxid_type source);
The flags array is a length $|V|$ array that specifies the
set of vertices in the graph which are reachable from the source
vertex: a vertex with identifier v
is reachable from the
given source vertex if and only if there is a true
value in the
$v^{\mathrm{th}}$ position of the flags array that is
returned by bfs
.
adjlist graph = { mk_edge(0, 1), mk_edge(0, 3), mk_edge(5, 1), mk_edge(3, 0), mk_edge(3, 5), mk_edge(3, 2), mk_edge(5, 3), mk_edge(4, 6), mk_edge(6, 2) }; std::cout << graph << std::endl; sparray reachable_from_0 = bfs(graph, 0); std::cout << "reachable from 0: " << reachable_from_0 << std::endl; sparray reachable_from_4 = bfs(graph, 4); std::cout << "reachable from 4: " << reachable_from_4 << std::endl;
The following diagram shows the structure represented by graph
.
Output:
digraph {
0 -> 1;
0 -> 3;
3 -> 0;
3 -> 5;
3 -> 2;
4 -> 6;
5 -> 1;
5 -> 3;
6 -> 2;
}
reachable from 0: { 1, 1, 1, 1, 0, 1, 0 }
reachable from 4: { 0, 0, 1, 0, 1, 0, 1 }
The next challenge is the implementation of the frontiers. To obtain
an observably work efficient algorithm, we shall represent frontiers
as arrays. Let’s assume that we have a function called edge_map
with the following signature the "edge map" operation. This operation
takes as parameters a graph, an array of atomic flag values, and a
frontier and returns a new frontier.
sparray edge_map(const adjlist& graph, std::atomic<bool>* visited, const sparray& in_frontier);
Using this function, we can implement the main loop of BFS as shown
below. The algorithm uses the edge-map
to advance level by level
through the graph. The traversal stops when the frontier is empty.
loop_controller_type bfs_init_contr("bfs_init"); sparray bfs(const adjlist& graph, vtxid_type source) { long n = graph.get_nb_vertices(); std::atomic<bool>* visited = my_malloc<std::atomic<bool>>(n); parallel_for(bfs_init_contr, 0l, n, [&] (long i) { visited[i].store(false); }); visited[source].store(true); sparray cur_frontier = { source }; while (cur_frontier.size() > 0) cur_frontier = edge_map(graph, visited, cur_frontier); sparray result = tabulate([&] (value_type i) { return visited[i].load(); }, n); free(visited); return result; }
One minor technical complication relates to the result value: our
algorithm performs extra work to copy out the values from the visited
array. Although it could be avoided, we choose to copy out the values
because it is more convenient for us to program with ordinary
sparray
's. Here is an example describing the behavior of the
edge_map
function.
edge_map
adjlist graph = // same graph as shown in the previous example const long n = graph.get_nb_vertices(); std::atomic<bool> visited[n]; for (long i = 0; i < n; i++) visited[i] = false; visited[0].store(true); visited[1].store(true); visited[3].store(true); sparray in_frontier = { 3 }; sparray out_frontier = edge_map(graph, visited, in_frontier); std::cout << out_frontier << std::endl; sparray out_frontier2 = edge_map(graph, visited, out_frontier); std::cout << out_frontier2 << std::endl;
Output:
{ 5, 2 }
{ }
There are several ways to implement edge_map
. One way is to
allocate an array that is large enough to hold the next frontier and
then allow the next frontier to be computed in parallel. Since we
don’t know the exact size of the next frontier ahead of time, we will
bound it by using the total number of out-going edges originating at
the vertices in the frontier. To mark unused vertices in this array,
we can use a sentinel value.
const vtxid_type not_a_vertexid = -1l;
The following array represents a set of three valid vertex identifiers, with two positions in the array being empty.
{ 3, not_a_vertexid, 0, 1, not_a_vertexid }
Let us define two helper functions. The first one takes an array of vertex identifiers and copies out the valid vertex identifiers.
sparray just_vertexids(const sparray& vs) { return filter([&] (vtxid_type v) { return v != not_a_vertexid; }, vs); }
The other function takes a graph and an array of vertex identifiers and returns the array of the degrees of the vertex identifiers.
sparray get_out_degrees_of(const adjlist& graph, const sparray& vs) { return map([&] (vtxid_type v) { return graph.get_out_degree_of(v); }, vs); }
We can implement edge-map as follows. We first allocate the array for
the next frontier by upper bounding its size; each element of the
array is initially set to not_a_vertexid
. We then, in parallel,
visit each vertex in the frontier and attempt, for each neighbor, to
insert the neighbor into the next frontier by using an atomic
compare-and-swap operations. If we succeed, then we write the vertex
into the next frontier. If not, we skip. Once all neigbors are
visited, we pack the next frontier by removing non-vertices. This
packed array becomes our next frontier.
loop_controller_type process_out_edges_contr("process_out_edges"); loop_controller_type edge_map_contr("edge_map"); sparray edge_map(const adjlist& graph, std::atomic<bool>* visited, const sparray& in_frontier) { // temporarily removed. }
The complexity function used by the outer loop in the edge map is interesting because the complexity function treats the vertices in the frontier as weighted items. In particular, each vertex is weighted by its out degree in the graph. The reason that we use such weighting is because the amount of work involved in processing that vertex is proportional to its out degree. We cannot treat the out degree as a constant, unfortunately, because the out degree of any given vertex is unbounded, in general. As such, it should be clear why we need to account for the out degrees explicitly in the complexity function of the outer loop.
12.2.3. Performance analysis
Our parallel BFS is asymptotically work efficient: the BFS takes work $O(n + m)$. To establish this bound, we need to assume that the compare-and-exchange operation takes constant time. After that, confirming the bound is only a matter of inspecting the code line by line. On the other hand, the span is more interesting.
Tip
|
In order to answer this question, we need to know first about the graph diameter. The diameter of a graph is the length of the shortest path between the two most distant vertices. It should be clear that the number of iterations performed by the while loop of the BFS is at most the same as the diameter. |
12.2.4. Direction Optimization
If the graph has reverse edges for each vertex, we can use an
alternative "bottom-up" implementation for edge_map
instead of the
"top-down" approach described above: instead of scanning the vertices
in the frontier and visiting their neighbors, we can check, for any
unvisited vertex, whether that vertex has a parent that is in the
current frontier. If so, then we add the vertex to the next
frontier. If the frontier is large, this approach can reduce the total
number of edges visited because the vertices in the next frontier will
quickly find a parent. Furthermore, we don’t need use
compare-and-swap operations. The disadvantage of the bottom-up
approach is that it requires linear work in the number of vertices.
However, if the frontier is already large, the top-down approach
requires lineal work too. Thus if, the frontier is large, e.g., a
constant fraction of the vertices, then the bottom-up approach can be
beneficial. On current multicore machines, it turns out that if the
frontier is larger than 5% of the total number of vertices, then the
bottom-up approach tends to outperform the top-down approach.
An optimized "hybrid" implementation of PBFS can select between the top-down and the bottom-up approaches proposed based on the size of the current frontier. This is called "direction-optimizing" BFS.