Making Parallel Programming Easy and Portable
For parallel computing to succeed in the mainstream parallel
programming needs to be made as easy as sequential programming, or at
least almost as easy. Currently many problems that can be solved in a
couple dozen lines of sequential code require hundreds or sometimes
thousands of lines of code to be solved efficiently in parallel.
Furthermore the parallel code is typically much harder to understand,
modify and debug than its sequential counterpart. This has limited
parallel programming to experts, and to applications in which the
performance is absolutely critical. Our contention, along with many
others, is that programming parallel algorithms and applications is
not inherently difficult, but rather is difficult at present for the
following reasons:
- The community initially concentrated on the short term of getting
"peak" performance rather than long-term goal of reducing software
development costs. This emphasis is changing and we are likely to see
much better parallel tools and languages in the future.
- The early parallel machines had poor support for communication
among processors (e.g. high latency, low bandwidth and awkward
protocols) so that much of the programming and compiler effort went
into low-level details of how to reduce communication costs, rather
than high-level ideas on how to simplify applications and algorithms.
With the advent of machines with better support for communication and
software support that hides many of the communication details this
emphasis is changing.
- For pragmatic reasons there has been an emphasis on making simple
modifications to sequential languages, such as adding libraries, rather
than thinking about parallelism from scratch. This has
resulted in languages in which the parallelism interacts badly with
other features of the languages. It has also limited the types of
extensions that have been considered.
Quicksort: A motivational example
To appreciate that parallelism is not inherently difficult, consider
the Quicksort algorithm. Here is pseudocode for the algorithm from
Aho, Hopcroft and Ullman's text book ``The Design and Analysis of
Computer Algorithms''.
procedure QUICKSORT(S):
if S contains at most one element then return S
else
begin
choose an element a randomly from S;
let S_1, S_2 and S_3 be the sequences of elements in S less
than, equal to, and greater than a, respectively;
return (QUICKSORT(S_1) followed by S_2 followed by
QUICKSORT(S_3))
end
This algorithm was considered by the authors as a sequential algorithm
and can be expressed concisely in most sequential languages.
The interesting aspect, however, is that the algorithm is in fact
highly parallel in nature. There are two places in which parallelism
appears in the algorithm:
-
The two recursive calls to
QUICKSORT
can be made in parallel.
We could fork these off as two parallel tasks, which in turn could
each fork their recursive calls to two parallel tasks.
-
We can subselect the elements
S_1
from S
in
parallel (similarly for S_2
and S_3
).
If each processor has some subset of the input S
,
then clearly we can broadcast the pivot a
and each
processor can independently select the elements in its subset that are
less than the pivot.
Although Quicksort can be considered a "toy" example, the algorithm is
actually quite involved relative to many other parallel applications
because it is highly dynamic (one does not know the sizes of the
recursive calls ahead of time), and has multiple sources of
parallelism (the recursive calls and the subselection for generating
S_1
).
Expressing Quicksort in Parallel
If Quicksort is indeed inherently parallel, it seems that we should be
able to express it in a parallel language as concisely as in a
sequential language. Much of the earlier work on parallelizing
quicksort had only considered the parallelism from recursive calls [Hal85],
which leas to very little parallelism --- in such a version just
partitioning the elements will require O(n) time so the best
speedup we could hope for over a sequential implementation would be
O(lg n). Here we are interested in a version that expressed
both forms of parallelism.
To appreciate the difficulty of programming a "toy" example such as
Quicksort, we coded a parallel version of the algorithm that takes
advantage of both forms of parallelism in C + MPI (MPI is a message
passing interface for programming parallel machines that can be used as
a library with C or Fortran). This required 1700 lines of code, including explicit
load-balancing: hardly a toy. Here is the code for Quicksort in NESL:
In NESL parallelism is expressed with curly brackets.
For example, in the above code the expression
{e in S | e < a}
can be read as "in parallel for each e
in S
select all e
s that
are less than a
".
Similarly
{QUICKSORT(v): v in [S_1, S_3]}
can be read as "in parallel for each v
in the sequence
[S_1, S_3]
, QUICKSORT(v)
."
Of course having concise parallel code does not imply an efficient
implementation, and a major effort in our project has been in
compiler techniques.
Our overall goal can therefore be summarized as: being able to
write parallel code as concisely and clearly as sequential code, while
achieving close to peak performance.
Back to the NESL page.