v1.0 2016-01
Author: Umut A. Acar
1. Chapter: Multithreading, Parallelism, and Concurrency
The term multithreading refers to computing with multiple threads of control where all threads share the same memory. Once created, a thread performs a computation by executing a sequence of instructions, as specified by the program, until it terminates. A multithreaded computation starts by executing a main thread, which is the thread at which the execution starts. A thread can create or spawn another thread and synchronize with other threads by using a variety of synchronization constructs such as locks, mutex’s, synchronization variables, and semaphores.
1.1. DAG Representation
A multithreaded computation can be represented by a dag, a Directed Acyclic Graph, or written also more simply a dag, of vertices. The figure below show an example multithreaded computation and its dag. Each vertex represents the execution of an instruction, such as an addition, a multiplication, a memory operation, a (thread) spawn operation, or a synchronization operation. A vertex representing a spawn operation has outdegree two. A synchronization operation waits for an operation belonging to a thread to complete, and thus a vertex representing a synchronization operation has indegree at least two. Recall that a dag represents a partial order. Thus the dag of the computation represents the partial ordering of the dependencies between the instructions in the computation. Perhaps the simplest multithreaded program is a sequential program, which can be represented as a chain of (totally ordered) vertices.
Note
|
Multithreaded computing can be viewed as a natural generalization of sequential computing in the following sense: in sequential computing, an computation can be defined as a totally ordered set of instructions, where the ordering corresponds to the sequential execution order. In contrast, in multithreaded computing, a computation can be viewed as a partially ordered set of instructions (as specified by the dag), where the instructions may be executed in any order compatible with the specified partial order. |
Throughout this book, we make two assumptions about the structure of the dag:
-
Each vertex has outdegree at most two.
-
The dag has exactly one root vertex with indegree zero and one final vertex vertex with outdegree zero. The root is the first instruction of the root thread.
The outdegree assumption naturally follows by the fact that each vertex represents an instruction, which can create at most one thread.
1.2. Cost Model: Work and Span
For analyzing the efficiency and performance of multithreaded programs, we use several cost measures, the most important ones include work and span. We define the work of a computation as the number of vertices in the dag and the span as the length of the longest path in the dag. In the example dag above, work is $15$ and span is $9$.
1.3. Execution and Scheduling
The execution of a multithreaded computation executes the vertices in the dag of the computation in some partial order that is consistent with the partial order specified by the dag, that is, if vertices $u,v$ are ordered then the execution orders them in the same way.
Multithreaded programs are executed by using a scheduler that assigns vertices of the dag to processes.
The first condition ensures that a schedule observes the dependencies in the dag. Specifically, for each arc $(u,v)$ in the dag, the vertex $u$ is executed before vertex $v$.
For any step in the execution, we call a vertex ready if all the ancestors of the vertex in the dag are executed prior to that step. Similarly, we say that a thread is ready if it contains a ready vertex. Note that a thread can contain only one ready vertex at any time.
An example schedule with 3 processes. The length of this schedule is $10$
Time Step | Process 1 | Process 2 | Process 3 |
---|---|---|---|
1 |
M1 |
||
2 |
M2 |
||
3 |
M3 |
A1 |
|
4 |
A2 |
||
5 |
B1 |
A3 |
|
6 |
B2 |
A4 |
|
7 |
B2 |
M4 |
|
8 |
A5 |
M5 |
|
9 |
A6 |
||
10 |
M6 |
1.4. Scheduling Lower Bounds
The first lower bound follows by the simple observation that a schedule can only execute $P$ instructions at a time. Since all vertices must be executed, a schedule has length at least $\frac{W}{P}$. The second lower bound follows by the observation that the schedule cannot execute a vertex before its ancestors and thus the length of the schedule is at least as long as any path in the dag, which can be as large as the span $S$.
1.5. Offline Scheduling
Having established a lower bound, we now move on to establish an upper bound for the offline scheduling problem, where we are given a dag and wish to find an execution schedule that minimizes the run time. It is known that the related decision problem in NP-complete but that 2-approximation is relatively easy. We shall consider two distinct schedulers: level-by-level scheduler and greedy scheduler.
A level-by-level schedule is a schedule that executes the instructions in a given dag level order, where the level of a vertex is the longest distance from the root of the dag to the vertex. More specifically, the vertices in level 0 are executed first, followed by the vertices in level 1 and so on.
This theorem called Brent’s theorem was proved by Brent in 1974. It shows that the lower bound can be approximated within a factor of $2$.
Brent’s theorem has later been generalized to all greedy schedules. A greedy schedule is a schedule that never leaves a process idle unless there are no ready vertices. In other words, greedy schedules keep processes as busy as possibly by greedily assigning ready vertices.
1.6. Online Scheduling
In offline scheduling, we are given a dag and are interested in finding a schedule with minimal length. When executing multithreaded program, however, we don’t have full knowledge of the dag. Instead, the dag unfolds as we run the program. Furthermore, we are interested in not minimizing the length of the schedule but also the work and time it takes to compute the schedule. These two additional conditions define the online scheduling problem.
An online scheduler or a simply a scheduler is an algorithm that solves the online scheduling problem by mapping threads to available processes. For example, if only one processor is available, a scheduler can map all threads to that one processor. If two processors are available, then the scheduler can divide the threads between the two processors as evenly as possible in an attempt to keep the two processors as busy as possible by load balancing, which involves migrating pieces of work between processors so as to minimize idle time.
There are many different online-scheduling algorithms but these algorithms all operate similarly. We can outline a typical scheduling algorithm as follows.
For a given schedule generated by an online scheduling algorithm, we can define a tree of vertices, which tell us far a vertex, the vertex that enabled it.
We can give a simple greedy scheduler by using a queue of threads. At the start of the execution, the scheduler places the root thread into the queue and then repeats the following step until the queue becomes empty: for each idle process, take the thread at the front of the queue and assign it to the processor, let each processor run for one step, if at the end of the step, there are new ready threads, then insert them onto the tail of the queue.
The centralized scheduler with the global thread queue is a greedy scheduler that generates a greedy schedule under the assumption that the queue operations take zero time and that the dag is given. This algorithm, however, does not work well for online scheduling the operations on the queue take time. In fact, since the thread queue is global, the algorithm can only insert and remove one thread at a time. For this reason, centralized schedulers do not scale beyond a handful of processors.
There has been much research on the problem of reducing friction in scheduling. This research shows that distrubuted scheduling algorithms can work quite well. In a distributed algorithm, each processor has its own queue and primarily operates on its own queue. A load-balancing technique is then used to balance the load among the existing processors by redistributing threads, usually on a needs basis. This strategy ensures that processors can operate in parallel to obtain work from their queues.
A specific kind of distributed scheduling technique that can leads to schedules that are close to optimal is work stealing schedulers. In a work-stealing scheduler, processors work on their own queues as long as their is work in them, and if not, go "steal" work from other processors by removing the thread at the tail end of the queue. It has been proven that randomized work-stealing algorithm, where idle processors randomly select processors to steal from, deliver close to optimal schedules in expectation (in fact with high probability) and furthermore incur minimal friction. Randomized schedulers can also be implemented efficiently in practice. PASL uses an scheduling algorithm that is based on work stealing. We consider work-stealing in greater detail in a future chapter.
1.7. Writing Multithreaded Programs
Multithreaded programs can be written using a variety of language abstractions interfaces. In this section, we briefly outline one of the most widely used interfaces, the POSIX Threads or Pthreads interface, which specifies a programming interface for a standardized C language in the IEEE POSIX 1003.1c standard. Pthreads provide a rich interface that enable the programmer to create multiple threads of control that can synchronize by using the nearly the whole range of the synchronization facilities mentioned above. There are many other threading libraries, all of which provide similar facilities.
An example Pthread program is shown below. The main thread (executing
function main
) creates 8 child threads and terminates. Each child
in turn runs the function helloWorld
and immediately
terminates. Since the main thread does not wait for the children to
terminate, it may terminate before the children does, depending on how
threads are scheduled on the available processors.
#include <iostream> #include <cstdlib> #include <pthread.h> using namespace std; #define NTHREADS 8 void *helloWorld(void *threadid) { long tid; tid = (long)threadid; cout << "Hello world! It is me, 00" << tid << endl; pthread_exit(NULL); } int main () { pthread_t threads[NTHREADS]; int rc; int i; for( i=0; i < NTHREADS; i++ ){ cout << "main: creating thread 00" << i << endl; error = pthread_create(&threads[i], NULL, helloWorld, (void *) i); if (error) { cout << "Error: unable to create thread," << error << endl; exit(-1); } } pthread_exit(NULL); }
When executed this program may print the following.
main: creating thread 000
main: creating thread 001
main: creating thread 002
main: creating thread 003
main: creating thread 004
main: creating thread 005
main: creating thread 006
main: creating thread 007
Hello world! It is me, 000
Hello world! It is me, 001
Hello world! It is me, 002
Hello world! It is me, 003
Hello world! It is me, 004
Hello world! It is me, 005
Hello world! It is me, 006
Hello world! It is me, 007
But that would be unlikely, a more likely output would look like this:
main: creating thread 000
main: creating thread 001
main: creating thread 002
main: creating thread 003
main: creating thread 004
main: creating thread 005
main: creating thread 006
main: creating thread 007
Hello world! It is me, 000
Hello world! It is me, 001
Hello world! It is me, 006
Hello world! It is me, 003
Hello world! It is me, 002
Hello world! It is me, 005
Hello world! It is me, 004
Hello world! It is me, 007
Or may look like this
main: creating thread 000
main: creating thread 001
main: creating thread 002
main: creating thread 003
Hello world! It is me, 000
Hello world! It is me, 001
Hello world! It is me, 003
Hello world! It is me, 002
main: creating thread 004
main: creating thread 005
main: creating thread 006
main: creating thread 007
Hello world! It is me, 006
Hello world! It is me, 005
Hello world! It is me, 004
Hello world! It is me, 007
The pThreads library provides a rich interface for synchronizing between threads, e.g.,
-
a thread $t_1$ can join with another thread $t_2$ by blocking its execution until $t_2$ terminates,
-
threads can synchronize via mutex variables, e.g., a thread can lock a mutex , which if already locked, causes the thread to block,
-
threads can synchronize via condition variables, which are closely related to locks.
2. Critical Sections and Mutual Exclusion
In a multithreaded program, a critical section is a part of the program that may not be executed by more than one thread at the same time. Critical sections typically contain code that alters shared objects, such as shared (e.g., global) variables. This means that the a critical section requires mutual exclusion: only one thread can be inside the critical section at any time.
Since only one thread can be inside a critical section at a time, threads must coordinate to make sure that they don’t enter the critical section at the same time. If threads do not coordinate and multiple threads enter the critical section at the same time, we say that a race condition occurs, because the outcome of the program depends on the relative timing of the threads, and thus can vary from one execution to another. Race conditions are sometimes benign but usually not so, because they can lead to incorrect behavior. Spectacular examples of race conditions' effects include the "Northeast Blackout" of 2003, which affected 45 million people in the US and 10 million people in Canada.
It can be extremely difficult to find a race condition, because of the non-determinacy of execution. A race condition may lead to an incorrect behavior only a tiny fraction of the time, making it extremely difficult to observe and reproduce it. For example, the software fault that lead to the Northeast blackout took software engineers "weeks of poring through millions of lines of code and data to find it" according to one of the companies involved.
The problem of designing algorithms or protocols for ensuring mutual exclusion is called the mutual exclusion problem or the critical section problem. There are many ways of solving instances of the mutual exclusion problem. But broadly, we can distinguish two categories: spin-locks and blocking-locks. The idea in spin locks is to busy wait until the critical section is clear of other threads. Solutions based on blocking locks is similar except that instead of waiting, threads simply blocks or stops executing. When the critical section is clear, a blocked thread receives a signal that allows it to proceed. The term mutex, short for "mutual exclusion" is sometimes used to refer to a lock.
Mutual exclusions problems have been studied extensively in the context of several areas of computer science.
-
In operating systems research, processes, which like threads are independent threads of control, belonging usually but not always to different programs, can share certain systems' resources. To enable such sharing safely and efficiently, researchers have proposed various forms of locks such as semaphores, which accepts both a busy-waiting and blocking semantics. Another class of locks, called condition variables enable blocking synchronization by conditioning or the value of a variable.
-
In the area of concurrency, researchers have designed and implemented special concurrent data structures that ensure mutual exclusion without requiring locks or blocking by using special synchronization operations, called read-modify-write operations, provided by the hardware.
In these notes, we will mostly focus on the second approach.
2.1. Synchronization Hardware
Since mutual exclusion is a common problem in computer science, many hardware systems provide specific synchronization operations that can help solve instances of the problem. These operations may allow, for example, testing the contents of a (machine) word then modifying it, perhaps by swapping it with another word. Such operations are sometimes called atomic read-modify-write or RMW, for short, operations.
A handful of different RMW operations have been proposed. They
include operations such as load-link/store-conditional,
fetch-and-add, and compare-and-swap. They typically take the
memory location x
, and a value v
and replace the value of stored
at x
with f(x,v)
. For example, the fetch-and-add operation takes
the location x
and the increment-amount, and atomically increments
the value at that location by the specified amount, i.e., f(x,v) = *x
+ v
.
The compare-and-swap operation takes the location x
and takes a pair
of values (a,b)
, and stores b
into x
if the value in x
is a
,
i.e., f(x,(a,b)) = if *x = a then b else a
; the operation returns a
Boolean indicating whether the operation successfully stored a new
value in x
. The operation "compare-and-swap" is a reasonably
powerful synchronization operation: it can be used by arbitrarily many
threads to agree (reach consensus) on a value. This instruction
therefore is frequently provided by modern parallel architectures such
as Intel’s X86.
In C$++$, the atomic
class can be used to perform synchronization.
Objects of this type are guarantee to be free of race conditions; and
in fact, in C++, they are the only objects that are guaranteed to be
free from race conditions. The contents of an atomic
object can be
accessed by load
operations, updated by store
operation, and also
updated by compare_exchange_weak
and compare_exchange_strong
operations, the latter of which implement the compare-and-swap
operation.
Access to the contents of any given cell is achieved by the load()
and store()
methods.
std::atomic<bool> flag; flag.store(false); std::cout << flag.load() << std::endl; flag.store(true); std::cout << flag.load() << std::endl;
Output:
0
1
The key operation that help with race conditions is the compare-and-exchange operation.
std::atomic<bool> flag; flag.store(false); bool expected = false; bool was_successful = flag.compare_exchange_strong(expected, true); std::cout << "was_successful = " << was_successful << "; flag = " << flag.load() << "; expected = " << expected << std::endl; bool expected2 = false; bool was_successful2 = flag.compare_exchange_strong(expected2, true); std::cout << "was_successful2 = " << was_successful2 << "; flag = " << flag.load() << "; expected = " << expected2 << std::endl;
Output:
was_successful = 1; flag = 1; expected = 0
was_successful2 = 0; flag = 1; expected2 = 1
We can implement a spin lock using compare-and-exchange. Spin locks can ensure exclusive access to a shared resource such as a memory cell or a file. For example, suppose that multiple threads wants to print to the Standard IO. We could write such a program as follows.
std::atomic<bool> BigLock; struct args { int tid; }; void *hello(void *a) { args* myargs = (args*) a; int tid = myargs->tid; cout << "Hello world! It is me, Thread " << tid << endl; pthread_exit(NULL); } int main () { pthread_t threads[NTHREADS]; for(int i=0; i < NTHREADS; i++ ){ args* a = new args; a->tid = i; cout << "main: creating thread 00" << i << endl; int error = pthread_create(&threads[i], NULL, hello, (void *) a); if (error) { cout << "Error: unable to create thread," << error << endl; exit(-1); } } pthread_exit(NULL); }
Since the Standard IO is a shared resource, the multiple threads can write to it at the same time, resulting in garbled output. The example run below uses just two processors.
bash-3.2$ g++ -std=c++11 biglock-driver.cpp
bash-3.2$ ./a.out
main: creating thread 000
main: creating thread 001
main: creating thread 002
Hello world! It imsHHa eeimllnell:,oo cTwwrhooerrraelltaddid!!n g0II
ttt hiirsse ammdee ,,0 0TT3hh
rreeaadmdH a e1i2l
n
l:o cwroeraltdi!n gI tt hirse amde ,0 0T4h
read 3
Hmealilno: wcorreladt!i nIgt tihsr emaed, 0T0h5r
ead 4
Hello world! It is me, Thread 5
main: creating thread 006
main: creating thread 007
Hello world! It is me, Thread 6
Hello world! It is me, Thread 7
bash-3.2$
To ensure that each thread gets exclusive access to the standard IO, we can use a spin lock, where each thread waits for other threads to exit the critical section, where they access the Standard IO. To wait, a thread simply "spins" while checking the lock. Before entering its critical section, each thread takes the lock and upon exiting it releases the lock.
std::atomic<bool> BigLock; /* Take BigLock */ void takeBigLock () { while (1) { bool flag = false; if (BigLock.compare_exchange_strong(flag,true)) { return; } } } /* ReleaseBig BigLock */ void releaseBigLock () { while (1) { bool flag = true; if (BigLock.compare_exchange_strong(flag,false)) { return; } } } void *hello(void *a) { args* myargs = (args*) a; int tid = myargs->tid; takeBigLock (); cout << "Hello world! It is me, Thread " << tid << endl; releaseBigLock (); pthread_exit(NULL); } int main () { pthread_t threads[NTHREADS]; BigLock.store(false); for(int i=0; i < NTHREADS; i++ ){ args* a = new args; a->tid = i; takeBigLock (); cout << "main: creating thread 00" << i << endl; releaseBigLock (); int error = pthread_create(&threads[i], NULL, hello, (void *) a); if (error) { cout << "Error: unable to create thread," << error << endl; exit(-1); } } pthread_exit(NULL); }
Here is an example run of the program.
bash-3.2$ g++ -std=c++11 biglock-driver.cpp bash-3.2$ ./a.out main: creating thread 000 main: creating thread 001 main: creating thread 002 Hello world! It is me, Thread 0 main: creating thread 003 Hello world! It is me, Thread 1 main: creating thread 004 Hello world! It is me, Thread 3 Hello world! It is me, Thread 2 main: creating thread 005 Hello world! It is me, Thread 4 main: creating thread 006 Hello world! It is me, Thread 5 main: creating thread 007 Hello world! It is me, Thread 6 Hello world! It is me, Thread 7
Note that the threads do not necessarily run in order even on to cores but their messages are printed correctly, without being garbled.
2.2. Nonblocking Concurrent Data Structures
In some cases, we have multiple threads operating on a shared data structure such that the updates of one thread become visible to others. We refer to such data structures as concurred data structures. For example, we may have multiple threads using a stack or a queue data structure to send and receive items between each other.
The crux of the problem of designing a concurrent data structure is ensuring correctness without penalizing performance. Indeed, if we don’t care about performance, it is trivial to implement a concurrent data structure by using a lock for the data structure such that any thread that wants to use the data structure takes a lock for it before using it, using it, and releases the lock afterwards. In some cases, this is all we can do, e.g., the accesses to the Standard IO in our prior example cannot be handled differently without using a different, perhaps a more "finer grain" interface for IO, But in many cases, there is a lot that can be done to ensure good performance.
One the most commonly used techniques is to use non-blocking atomic read-modify-write operations such as compare-and-exchange to implement concurrent data structures. In this approach, since threads never block when using locks, a thread cannot prevent another thread from making progress by taking a lock and blocking on the lock, hence the name "non-blocking". A non-blocking data structures is called lock free if system-wide progress can be guaranteed, even as threads are suspended. A non-blocking data structure is called wait free if per-thread progress can also be guaranteed.
2.2.1. A non-blocking stack data structure
Suppose that we wish to implement a concurrent stack data structure that can be shared my multiple threads. The code below shows the code for a standard serial "integer" stack that can hold integer values in its nodes.
#include <iostream> #include <cstdlib> using namespace std; class Node { public: int value; Node* next; Node (int v) { value = v; next = NULL; } }; class Stack { public: Node* top; Stack () { top = NULL; }; int pop (); void push (int v); }; int Stack::pop () { if (top == NULL) { cout << "Stack is Empty" << endl; return -12; } else { int oldTop = top->value; top = top->next; return oldTop; } } void Stack::push (int value) { top = new Node (value,top); return ; }
Allowing such a data structure by multiple threads leads to race conditions and thus accesses must be protected by enforcing mutual exclusion. An example where multiple threads are operating on a shared stack is shown below.
#include <iostream> #include <cstdlib> #include <pthread.h> #include <atomic> #include "nb-stack.h" using namespace std; #define NTHREADS 8 #define NPUSHPOP 2 struct args { int tid; Stack* s; }; void *pushPop(void *a) { args* myargs = (args*) a; int tid = myargs->tid; Stack* sharedStack = myargs->s; cout << "Hello world! It is me, 00" << tid << endl; for (int i = 0; i < NPUSHPOP; ++i) { int j = NPUSHPOP * tid + i; sharedStack->push (NPUSHPOP * tid + i); sleep (1); int k = sharedStack->pop (); sleep (1); } pthread_exit(NULL); } int main () { pthread_t threads[NTHREADS]; Stack* sharedStack = new Stack (); int rc; for(int i=0; i < NTHREADS; i++ ){ args* a = new args; a->tid = i; a->s = sharedStack; int error = pthread_create(&threads[i], NULL, pushPop, (void *) a); if (error) { cout << "Error: unable to create thread," << error << endl; exit(-1); } } pthread_exit(NULL); }
The easiest way to ensure mutual exclusion is to use a lock to guard accesses to the shared data structure, as in our Standard IO example discussed earlier. Such a lock, however, serializes all accesses to the data structure, turning the data structure into a bottleneck. Instead, we would like to allow the threads to operate concurrently but without breaking the correctness of the stack.
We can use the compare-and-exchange operation that we saw earlier to
achieve this. The idea is to identify the parts of the code that
require mutually exclusive access to certain data and make sure that
such data are updated atomically. The code below show an
implementation of stacks that attempts to guarantee mutual exclusion.
In function push
, we first set up the new node, u
, that we want to
add by using the current value of top
. We then update top
, only
if it has not changed since we have read it by using
compare_exchange_strong
. Similarly, in function pop
, we read the
current value of top
into a variable oldTop
and then create the
new value for top, next
. We then replace top
with next
by using
a compare_exchange_strong
only if top
has not changed since we
last read it.
#include <iostream> #include <cstdlib> #include <atomic> using namespace std; class Node { public: int value; Node* next; Node (int v) { value = v; next = NULL; } }; class Stack { public: std::atomic<Node*> top; Stack () { top = NULL; }; int pop (); void push (int v); }; int Stack::pop () { while (1) { /* take picture */ Node* oldTop = top.load(); if (oldTop == NULL) { cout << "Stack is Empty" << endl; return -12; } int oldTopValue = oldTop->value; /* local change */ Node* next = oldTop->next; /* compare and commit */ if (top.compare_exchange_strong(oldTop,next)) { return oldTopValue; } } } void Stack::push(const int value) { Node *u = new Node(value); while (1) { u->next = top.load(); if (top.compare_exchange_strong(u->next, u)) { break; } } }
The example above illustrates a typical use of the compare-and-swap operation. Can we prove our code to be correct. It might appear so but actually this not the case. This data structure is actually not correct due to a problem called the "ABA problem."
2.2.2. ABA problem
While reasonably powerful, compare-and-swap operation suffers from the
so-called ABA problem. To see this consider the following
scenario where a shared variable top
is updated by multiple threads
in parallel.
-
Thread $T_1$, reads the
top
and stores its current value, say nodeM
, inoldTop
. It then continues to computenext
as the next node in the stack, lets sayN
. Before $T_1$ can complete, however, another thread steps in. -
Immediately after this, another thread $T_2$ reads
top
and performs apop
operation to remove nodeM
and pushes a new nodeO
onto the stack. At this point, the stack looks likeO::N
with the top of the stack pointing to nodeO
, which points to nodeN
. At this point, thread $T_2$ pushesM
back onto the stack, the stack now looks likeM::0::N
. -
Thread $T_1$ now steps in to continue its
pop
operation by attempting to rumcompare_exchange_strong
. The operation succeeds because nodeM
is the at the top of the stack, which matchesoldTop
of $T_1$. The pop operation thus completes settingtop
tonext
, which isN
. This is, however, incorrect because the nodeO
has disappeared from the stack.
In summary, compare-and-swap was not able to detect the change in the
stack structure, because it only relies on a simple "shallow" notion
of equality. The fact that top
and oldTop
has the same node as
their values does not mean that the stack has not been modified!
This problem is called the ABA problem, because it involves cycling the atomic memory cell between the three values $A$, $B$, and again $A$). The ABA problem is an important limitation of compare-and-swap. It shows that the operation itself is not atomic in a true sense of the world but tries to imitate atomicity by relying on a shallow equality test. If it can be ensured that the shallow equality test of the subject memory cell suffices for correctness, the imitation can be useful, if not, then the operation cannot be considered atomic.
The ABA problem can lead to seemingly correct implementations that are in fact incorrect. To reduce the changes of bugs due to the ABA problem, memory objects subject to compare-and-swap are usually tagged with an additional field that counts the number of updates. This solves the basic problem but only up to a point because the counter itself can also wrap around. The load-link/store-conditional, a.k.a., LL/SC, operation solves this problem in principle by performing the write only if the memory location has not been updated since the last read (load). Unfortunately, practical implementations of LL/SC that actually meet its specifications are hard to come by.
Here is a refinement of our non-blocking stack code with the ABA
problem. The main difference between these two pieces of code is that
the refined version keeps a counter with the top
of the stack and
counts the number of times that the stack is popped. This makes it
possible to determine whether the stack has been modified since it has
been read by a thread. The approach, however does not fully solve the
ABA problem because the counter may overflow and wrap-up. This is
however, considered to be highly unlikely and this solution can be
seen used quite frequently in practice.
#include <iostream> #include <cstdlib> #include <atomic> using namespace std; class Node { public: int value; Node* next; Node (int v) { value = v; next = NULL; } Node (int v, Node* u) { value = v; next = u; } }; class Stack { public: struct state_t { Node* top; int64_t nPops; }; std::atomic<state_t> state; Stack () { state_t s; s.top = NULL; s.nPops = 0; state.store (s); }; int pop (); void push (int v); }; int Stack::pop () { state_t oldState = state.load(); // ALTERNATE: This should also work // oldState = state.load (); // because oldState is updated by compare_and_exchange_strong while (1) { // This won't be needed in the ALTERNATE oldState = state.load (); Node* oldTop = oldState.top; if (oldTop == NULL) { cout << "Stack is Empty" << endl; return -12; } int oldTopValue = oldTop->value; Node* next = oldTop->next; /* new state */ state_t newState; newState.top = next; newState.nPops = oldState.nPops+1; if (state.compare_exchange_strong(oldState,newState)) { return oldTopValue; } } } void Stack::push(const int value) { Node *u = new Node(value); while (1) { state_t oldState = state.load (); u->next = oldState.top; /* new state */ state_t newState; newState.top = u; newState.nPops = oldState.nPops; if (state.compare_exchange_strong(oldState, newState)) { break; } } }
2.3. Memory Management
Another difficulty in concurrent data structures, and more generally, in multithreaded programs is ensuring correct usage of dynamically allocated memory. The challenge is to make sure that there are no "dangling references" to objects that are manually freed. This can be difficult, because in a multithreaded program there can be many different points to an object, making it difficult to find the "safe points" at which an object may be freed.
As a concrute example, imagine a shared stack data structure and
suppose that a thread is in the middle of a pop
operation and has
read the top node M
but has not completed it yet. Now, some other
threads comes and pops the node M
, and returns the value after
freeing the node M
. At this poin, there is a dangling reference to
freed node M
, because the first thread is still holding the value in
its variable oldTop
.
3. Chapter: Structured Multithreading
Multithreading interfaces such as Pthreads enable the programmer to create a wide variety of multithreaded computations that can be structured in many different ways. Multithreaded programs, however, can be extremely difficult to reason about, get right (correct), and make efficient, because they could require reasoning about exponentially many interleavings of code.
Consider as a toy multithreaded program consisting of 16 threads, each of which runs a 100 line program. To reason about the correctness of a program, we might have to reason about at least $16^100$ different interleavings. This is a huge number, larger than the number of atoms in the known universe. All evidence suggests that human beings are not able to reason about such a large number of possibilities. Perhaps for this reason, multithreaded programs are known to be notoriously difficult to get right. Common problem include
-
reasoning about the correctness shared state and absence of race conditions,
-
reasoning about efficiency of concurrent data structures, and
-
reasoning about correctness of memory management in non-GC’ed programming languages.
For example, over the past two decades researchers have developed many concurrent data structures, ranging from simple counters to more complex structures such as concurrent queues and stacks. Such data structures have proved to be very difficult to establish correct; in fact many were found to be incorrect after publication. The asymptotic efficiency of realistic concurrent data structures have also turned out to be difficult to establish, with relatively few tight bounds.
Fortunately, large classes of interesting multithreaded computations, can be written using a more structured approach, where threads are restricted in the way that they synchronize with other threads. One such interesting class of computations is fork-join computations where a thread can spawn or "fork" another thread or "join" with another existing thread. Joining a thread is the only mechanism through which threads synchronize. The figure below illustrates a fork-join computation. The main threads forks thread A, which then spawns thread B. Thread B then joins thread A, which then joins Thread M.
In addition to fork-join, there are other interfaces for structured multithreading such as async-finish, and futures. As in fork-join, these approaches provide the programmer with a simple interface for expressing concurrency by restricting the manner in which threads can synchronize.
One way provide language support for structured multithreading is to simply use a more general threading library such as pThreads. Although it is correct, this approach would be grossly inefficient because structured multithreading can be implemented more efficiently. Indeed, it has been adopted and implemented specifically in many programming languages: the Cilk language is primarily based on fork-join but also has some limited support for async-finish; X10 language is primarily based on async-finish but also supports futures; Fork Join Java and Habanero Java extend the Java language with structured multithreading; and Haskell language provides support for fork-join and futures as well as others; Parallel ML language as implemented by the Manticore project and by the Steel project at CMU is primarily based on fork-join parallelism. Such languages are sometimes called implicitly parallel languages.
The class of computations that can be expressed as fork-join and async-finish programs are sometimes called nested parallel. The term "nested" refers to the fact that a parallel computation can be nested within another parallel computation. This is as opposed to flat parallelism where a parallel computation can only perform sequential computations in parallel. Flat parallelism used to be common technique in the past but becoming increasingly less prominent.
The class of computations that can be expressed with futures is sometimes called pipelined parallelism. In this course, we will not discuss futures further.
3.1. Parallelism versus concurrency
Structured multithreading offers important benefits both in terms of efficiency and expressiveness. Using programming constructs such as fork-join and futures, it is possible to write parallel programs such that the program accepts a "sequential semantics" but executes in parallel. The sequential semantics enables the programmer to treat the program as a serial program for the purposes of correctness. A run-time system then creates threads as necessary to execute the program in parallel. Since threads synchronize only in specific ways, the run-time systems can optimize the creating and scheduling of threads so as to maximize efficiency. Structured multithreading thus offers in some ways the best of both worlds: the programmer can reason about correctness sequentially but the program executes in parallel.
More precisely, consider a purely functional sequential language such as the untyped (pure) lambda calculus and its sequential dynamic semantics specified as a strict, small step transition relation. We can extend this language with the structured multithreading by enriching the syntax language with "fork-join" and "futures" constructs. We can now extend the dynamic semantics of the language in two ways: 1) trivially ignore these constructs and execute serially as usual, and 2) execute in parallel by creating parallel threads. With some care, we can establish these two semantics to be identical, i.e., they produce the same value for the same expressions. In other words, we can extend a rich programming language with fork-join and futures and still give the language a sequential semantics. This shows that structured multithreading is nothing but an efficiency and performance concern; it can be ignored from the perspective of correctness.
We use the term parallelism to refer to the idea of computing in parallel by using such structured multithreading constructs. As we shall see, we can write parallel algorithms for many interesting problems. Although parallel algorithms or applications constitute a large class, they don’t cover all applications. Specifically applications that can be expressed by using richer forms of multithreading such as the one offered by Pthreads do not always accept a sequential semantics. In such concurrent applications, threads can communicate and coordinate in complex ways to accomplish the intended result. One such example is the multithreaded program that we considered for using a shared non-blocking stack. Another example, which is a classic concurrency example, is the "producer-consumer problem", where a consumer and a producer thread coordinate by using a fixed size buffer of items. The producer fills the buffer with items and the consumer removes items from the buffer and they coordinate to make sure that the buffer is never filled more than it can take. Operating-system level processes sometimes communicate similarly in some concurrent applications.
In summary, parallelism is a property of the hardware or the software platform where the computation takes place, whereas concurrency is a property of the application. Pure parallelism can be ignored for the purposes of correctness; concurrency cannot be ignored for understanding the behavior of the program.
Parallelism and concurrency are orthogonal dimensions in the space of all applications. Some applications are concurrent, some are not. Many concurrent applications can benefit from parallelism. For example, a browser, which is a concurrent application itself as it may use a parallel algorithm to perform certain tasks. On the other hand, there is usually no need to add concurrency to a parallel application, because this unnecessarily complicates software. It can, however, lead to improvements in efficiency.
The following quote from Dijkstra suggest pursuing the approach of making parallelism just a matter of execution (not one of semantics), which is the goal of the much of the work on the development of programming languages today. Note that in this particular quote, Dijkstra does not mention that parallel algorithm design requires thinking carefully about work and span, as opposed to just work as is sequential computing.
— Edsger W. Dijkstra
3.2. Parallelism and Mutual Exclusion
In parallel programming, mutual exclusion problems do not have to arise. For example, if we program in a purely functional language extended with structured multithreading primitives such as fork-join and futures, programs remain purely functional and mutual-exclusion problems, and hence race conditions, do not arise. If we program in an imperative language, however, where memory is always a shared resource, even when it is not intended to be so, threads can easily share memory objects, even unintentionally, leading to race conditions.
In the code below, both branches of fork2
are writing into b
.
What should then the output of this program be?
long b = 0; fork2([&] { b = 1; }, [&] { b = 2; }); std::cout << "b = " << std::endl;
At the time of the print, the contents of b
is determined by the
last write. Thus depending on which of the two branches perform the
write, we can see both possibilities:
Output:
b = 1
Output:
b = 2
Consider the following alternative implementation of the Fibonacci
function. By "inlining" the plus operation in both branches, the
programmer got rid of the addition operation after the fork2
.
long fib_par_racy(long n) { long result = 0; if (n < 2) { result = n; } else { fork2([&] { result += fib_par_racy(n-1); },[&] { result += fib_par_racy(n-2); }); } return result; }
This code is not correct because it has a race condition.
As in the example shows, separate threads are updating the value
result but it might look like this is not a race condition because the
update consists of an addition operation, which reads the value and
then writes to it. The race condition might be easier to see if we
expand out the applications of the +=
operator.
long fib_par_racy(long n) { long result = 0; if (n < 2) { result = n; } else { fork2([&] { long a1 = fib_par_racy(n-1); long a2 = result; result = a1 + a2; },[&] { long b1 = fib_par_racy(n-2); long b2 = result; result = b1 + b2; }); } return result; }
When written in this way, it is clear that these two parallel threads
are not independent: they both read result
and write to
result
. Thus the outcome depends on the order in which these reads
and writes are performed, as shown in the next example.
The following table takes us through one possible execution trace of
the call fib_par_racy(2)
. The number to the left of each instruction
describes the time at which the instruction is executed. Note that
since this is a multithreaded program, multiple instructions can be
executed at the same time. The particular execution that we have in
this example gives us a bogus result: the result is 0, not 1 as it
should be.
Time step | Thread 1 | Thread 2 |
---|---|---|
1 |
a1 = fib_par_racy(1) |
b2 = fib_par_racy(0) |
2 |
a2 = result |
b3 = result |
3 |
result = a1 + a2 |
_ |
4 |
_ |
result = b1 + b2 |
The reason we get a bogus result is that both threads read the initial
value of result at the same time and thus do not see each others
write. In this example, the second thread "wins the race" and writes
into result
. The value 1 written by the first thread is effectively
lost by being overwritten by the second thread.
In the example, race condition arises because of concurrent writes to
the result
variable. We can eliminate this kind of race condition
by using different memory locations, or by using an atomic class and
using a compare_exchange_strong
operation.
The following implementation of Fibonacci is not safe because the
variable result
is shared and updated by multiple threads.
long fib_par_racy(long n) { long result = 0; if (n < 2) { result = n; } else { fork2([&] { result += fib_par_racy(n-1); },[&] { result += fib_par_racy(n-2); }); } return result; }
We can solve this problem by declaring result
to be an atomic type
and using a standard busy-waiting protocol based on compare-and-swap.
long fib_par_atomic(long n) { atomic<long> result = 0; if (n < 2) { result.store(n); } else { fork2([&] { long r = fib_par_racy(n-1); // Atomically update result. while (true) { long exp = result.load(); bool flag = result.compare_exchange_strong(exp,exp+r) if (flag) {break;} } },[&] { long r = fib_par_racy(n-2); // Atomically update result. while (true) { long exp = result.load(); bool flag = result.compare_exchange_strong(exp,exp+r) if (flag) {break;} } }); } return result; }
The idea behind the solution is to load the current value of result
and atomically update result
only if it has not been modified (by
another thread) since it was loaded. This guarantees that the
result
is always updated (read and modified) correctly without
missing an update from another thread.
4. Chapter: Scheduling Multithreaded Programs with Work Stealing
The work-stealing algorithm is a solution to the online scheduling problem, where we are given a $P$ workers (a.k.a., processors or processes) and a computation dag that unfolds dynamically during execution, and asked to construct an execution schedule with minimal length, while spending as little work and time for scheduling itself.
We consider multithreaded computations represented as dags as described in an earlier chapter. To streamline the analysis, we assume without loss of generality that the root vertex has a single child. This assumption eliminates a few corner cases from the analysis.
4.1. Work-Stealing Algorithm
In work stealing, each worker maintains a deque, doubly ended queue, of vertices. Each worker tries to work on its local deque as much as possible. Execution starts with the root vertex in the deque of one of the workers. It ends when the final vertex is executed.
A work stealing scheduler operates as described by our generic scheduling algorithm but instead of operating on threads, it operates on vertices of the dag. To obtain work, a worker pops the vertex at the bottom of its deque and executes it. We refer to the vertex being executed by a worker as the assigned vertex. When executed, the ready vertex can make the other vertices ready, which are then pushed onto the bottom end of the deque in an arbitrary order. If a worker finds its deque empty, then the worker becomes a thief. A thief picks a victim worker at random and attempts to steals a thread from another by popping a thread off the top of the victim’s deque. The thief performs steal attempts until it successfully steals a thread, at which point, the thief goes back to work and the stolen thread becomes its assigned thread.
The pseudo-code for the algorithm is shown below. All deques are initially empty and the root vertex is assigned to the worker zero. The algorithm operates in rounds. In each round, a worker executes the assigned vertex if any, pushes the newly enabled vertices to its deque, and obtains a new assigned vertex from its deque. If the round starts with no assigned vertex then the worker becomes a thief performs a steal attempt. Note that a steal attempt start and completes in the same round.
Such a steal attempt can fail if
-
the victim’s deque is empty, or
-
contention between workers occurs and the vertex targeted by the thief is executed by the worker that own the dequer or stolen by another thief.
For the analysis of the algorithm, we shall assume that each instruction and each deque operations executes in a single step to execute. As a result, each iteration of the loop, a round, completes in constant steps.
// Assign root to worker zero. assignedVertex = NULL if (self == WorkerZero) { assignedVertex = rootVertex } // Run scheduling loop. while (computationDone == false) { // Execute assigned vertex. if (assignedVertex <> NULL) { (nChildren, child1, child2) = execute (assignedVertex) if (nChildren == 1) { self.pushBottom child1 } else { self.pushBottom child1 self.pushBottom child2 } assignedVertex = self.popBottom () } else { // Make steal attempt. victim = randomWorker () assignedVertex = victim.popTop () } }
Note
|
Note that when a vertex enables two vertices they are both pushed onto the bottom of the deque in an order that is unspecified. The analysis holds for any such order. However, realistic implementations will operate at the granularity of threads as defined in for example an earlier chapter and by the generic scheduling algorithm. To make this algorithm consistent with such implementations, we would push the vertex that is next in the thread last, so that it is executed next. Pushing and popping the vertex in of course unnecessary and should be avoided in an implementation. For the purposes of the analysis, this adds a small constant factor that we don’t care to account for. |
4.1.1. Deque Specification
The deque supports three methods:
-
pushBottom
, which pushes a vertex at the bottom end of the deque. -
popBottom
, which returns the vertex at the bottom end of the deque if any, or returnsNULL
otherwise. -
popTop
, returns the vertex at the top end of the deque, if any, or returnsNULL
if the deque is empty.
For the analysis, we assume that these operations take place atomically. We can think of them as starting by taking a lock for the deque, performing the desired operation, and releasing the lock.
For the analysis, we shall assume that all these operations take
constant time and in fact complete in one step. This assumption is
somewhat unrealistic, because it is not known whether popTop
can be
implemented in constant time. But a relaxed version of popTop
,
which allows popTop
to return NULL
if another concurrent operation
removes the top vertex in the deque, accepts a constant-time
implementation. This relaxed version suffices for our purposes.
4.1.2. Work sequence of a worker
Consider the execution of the work stealing algorithm and let $q$ be any worker. We define the work sequence of $q$ as the sequence of vertices defined by the assigned vertex of $q$ followed by the vertices in its deque ordered from bottom to top. If a vertex is in the work sequence of a worker, then we say that it belongs to that worker.
Consider the deque for a worker shows below along with the assigned vertex. The work sequence for this worker is $\langle v_0, v_1, v_2, v_3 \rangle$.
Now, if the worker completes executing $v_0$, which enables no new children, then the work sequence consist of the vertices in the deque, i.e., $\langle v_1, v_2, v_3 \rangle$. If the worker, later removes $v_1$ from its deque and starts working on it, i.e., $v_1$ becomes the assigned vertex, then the work sequence remains the same, i.e., $\langle v_1, v_2, v_3 \rangle$.
4.1.3. Enabling Tree
At any step in an execution, we say that a vertex is ready if all the ancestors of the vertex in the dag are executed prior to that step. If execution of a vertex $u$ makes ready another vertex $v$, then we say that $u$ enables $v$, and call the edge $(u,v)$ an enabling edge. We call $u$ the designated parent of $v.$ Every vertex except for the root has a designated parent. Therefore the subgraph of the dag consisting of the enabling edges form a rooted tree, called the enabling tree. Note each execution can have a different enabling tree.
4.1.4. An Example Run with Work Stealing
Consider the following computation dag.
The following drawings illustrate a 2-worker execution of this dag using work stealing. Time flows left to right and top to bottom.
The enabling tree for this execution is shown below.
4.1.5. Structural Lemma
The proof of the structural lemma is by induction on the number of rounds of the execution.
Consider the first round. At the initialization and before the beginning of the first round, all deques are empty, root vertex is assigned to worker zero but has not been executed. The root vertex is then executed. By assumption, the root vertex has a single child $v$, which becomes enabled and pushed onto the deque and popped again becoming the assigned vertex at the beginning of the second round. At any of point in time after the execution of the root, worker zero’s work sequence consist of $v$. The designated parent of $v$ is the root vertex and the lemma holds trivially.
For the inductive case, assume that the lemma holds up to beginning of some later round. We will show that it holds at any point during the round and also after the completion of the round.
Consider any worker and its deque. We have two cases to consider.
Case 1: There is an assigned vertex, $v_0$, which is executed.
By the definition of work sequences, we know that $v_1, \ldots, v_k$ are the vertices in the deque. Let $u_1, \ldots, u_k$ be their designated parents. By induction, we know that $u_i$ is an ancestor of $u_{i-1}$ in the enabling tree and the ancestor relationship is proper except for $i = 1$, where it is possible that $u_0 = u_1$. Immediately after the execution of the assigned node, the work sequence of the worker consists of all the vertices in the deque and the lemma follows.
After the execution of the assigned vertex $v_0$, we have several sub-cases to consider.
Case 1.1: execution of $v_0$ enables no children.
Since the deque remains the same, the lemma holds trivially.
Case 1.2: execution of $v_0$ enables one child $x$, which is pushed onto the bottom of the deque. In this case, $v_0$ becomes the parent of $x$. The lemma holds.
Case 1.2: execution of $v_0$ enables two children $x,y$, which are pushed to the bottom of the deque in an arbitrary order.
In this case, $v_0$ becomes the parent of $x$ and $y$. We need to show that $v_0 \neq u_1$. This holds because $v_0 \neq u_0$ and $v_0 \neq u_1$. The lemma holds.
After the execution of the assigned vertex completes and the children are pushed, the worker pops the vertex at the bottom of the deque. There are two cases to consider.
-
If the deque is empty, then the worker finds no vertex in its deque and there is no assigned vertex at the end of the round, thus the work sequence is empty and the lemma holds trivially.
-
If the deque is not empty, then the vertex at the bottom of the deque becomes the new assigned vertex. The lemma holds trivially because making the bottom vertex the assigned vertex has no impact on the work sequence of the worker and thus the correctness of the lemma.
Case 2: A successful steal takes place and removes the top vertex in the deque. In this case, the victim worker loses its top vertex, which becomes the assigned vertex of the thief. The work sequence of the victim loses its rightmost element. It is easy to see that the lemma continues to hold for the victim. When the stolen vertex is assigned, the work sequence of the thief consist of just the stolen vertex and the lemma holds for the thief.
4.2. Analysis
The strategy analysis is to assign a potential to each vertex and show that the potential decreases geometrically as the execution proceeds.
4.2.1. Weights
If $d(u)$ is the depth of a vertex $u$ in the enabling tree, then we define the weight of $u$, written $w(u)$ as follows $w(u) = S - d(u)$. The root has weight $S$. Intuitively, the weight is equal to the distance of a vertex from the completion.
A crucial corollary of the structural lemma is that the weights of the vertices decrease from top to bottom.
4.2.2. Balls and Bins Game
One crucial fact behind the analysis is a probabilistic lemma, called the Balls and Bins lemma. This lemma proves something relatively intuitive: if you throw as many ball as there are bins, chances are good that you will have a ball in at least a constant fraction of the bins, because chances of all balls landing in a small number of bins is low.
Let’s calculate the probability for $\beta = 1/2$. By the lemma, we know that if $P$ balls are thrown into $P$ bins, then the probability that the total weight of the bins that have a ball in them is at least half the total weight is $P \lbrack X \ge \frac{W}{2} \rbrack 1 - \frac{1}{0.5e}$. Since $e > 2.71$, this quantity can be calculated as at least $0.25$. We thus conclude that we using the Ball and Bins lemma, we can "collect" at least half of the weight with probability at least $0.25$
4.2.3. Bound in terms of Work and Steal Attempts
Note
|
The proof assumes that each instructions including deque operations takes a (fixed) constant number of steps, because it assumes that each round contributes to the work or to the steal bucket. If this assumption is not valid, then we might need to change the notion of rounds so that they are large enough for steals to complete. |
4.2.4. Bounding the Number of Steal Attempts
Our analysis will use a potential-function based method. We shall divide the computation into phases each of which decreases the potential by a constant factor.
We start with a few definitions. Consider some round $i$.
-
Let $R_i$ denote the set of ready vertices in the beginning of that round.
-
A vertex is $R_i$ is either assigned to a worker or is in a deque. Let $R_i(q)$ denote the set of ready vertices belonging to a worker $q$ at the beginning of round $i$; these are exactly the vertices in the work sequence of that worker.
-
For each vertex $v \in R_i$, we define its potential as
-
$\phi_i(v) = 3^{2w(v) - 1}$, if $v$ is assigned, or
-
$\phi_i(v) = 3^{2w(v)}$, otherwise.
-
Note that potential of a vertex is always a natural number.
-
The potential of worker $q$ is $\Phi_i(q) = \sum_{v \in R_i(q)}{\phi_i(v)}.$
-
We write $H_i$ (mnemonic for "Hood") for the set of workers whose deques are empty at the beginning of round $i$. We write $\Phi_i(H_i)$ for the total potential of the workers $H_i$, $\Phi_i(H_i) = \sum_{q \in H_i}{\Phi_i(q)}$.
-
We write $D_i$ for the set of other workers. We write $\Phi_i(D_i)$ for the total potential of the workers $D_i$, $\Phi_i(D_i) = \sum_{q \in D_i}{\Phi_i(q)}$.
-
Define the total potential at round $i$, written $\Phi_i$ as $\Phi_i = \sum_{v \in R_i}{\phi_i(v)}$. We can write the total potential in round $i$ as follows $\Phi_i = \Phi_i(H_i) + \Phi_i(D_i)$.
We have thus ustablished that the potential decreases but this by itself does not suffice. We also need to show that it decreases by some significant amount. This is our next step in the analysis. We show that after $P$ steal attempts the total potential decreases with constant probability.
Important
|
For this lemma to hold, it is crucial that a steal attempt
does not fail unless the deque is empty or the vertex being targeted
at the time is popped from the deque is some other way. This is why,
we required the popTop operation called by a worker to fail only if
the top vertex is removed from the deque by another worker. |
4.3. Chapter Notes.
The material presented here is a adapted from the paper:
In this chapter, we consider dedicated environments, and simplify and streamline the proof for this case. The original paper considers multiprogrammed environments.