Making Parallel Programming Easy and Portable

For parallel computing to succeed in the mainstream parallel programming needs to be made as easy as sequential programming, or at least almost as easy. Currently many problems that can be solved in a couple dozen lines of sequential code require hundreds or sometimes thousands of lines of code to be solved efficiently in parallel. Furthermore the parallel code is typically much harder to understand, modify and debug than its sequential counterpart. This has limited parallel programming to experts, and to applications in which the performance is absolutely critical. Our contention, along with many others, is that programming parallel algorithms and applications is not inherently difficult, but rather is difficult at present for the following reasons:

The community initially concentrated on the short term of getting "peak" performance rather than long-term goal of reducing software development costs. This emphasis is changing and we are likely to see much better parallel tools and languages in the future.
The early parallel machines had poor support for communication among processors (e.g. high latency, low bandwidth and awkward protocols) so that much of the programming and compiler effort went into low-level details of how to reduce communication costs, rather than high-level ideas on how to simplify applications and algorithms. With the advent of machines with better support for communication and software support that hides many of the communication details this emphasis is changing.
For pragmatic reasons there has been an emphasis on making simple modifications to sequential languages, such as adding libraries, rather than thinking about parallelism from scratch. This has resulted in languages in which the parallelism interacts badly with other features of the languages. It has also limited the types of extensions that have been considered.

Quicksort: A motivational example

To appreciate that parallelism is not inherently difficult, consider the Quicksort algorithm. Here is pseudocode for the algorithm from Aho, Hopcroft and Ullman's text book ``The Design and Analysis of Computer Algorithms''.

  procedure QUICKSORT(S):
  if S contains at most one element then return S
  else
     begin
        choose an element a randomly from S;
        let S_1, S_2 and S_3 be the sequences of elements in S less 
           than, equal to, and greater than a, respectively;
        return (QUICKSORT(S_1) followed by S_2 followed by
                QUICKSORT(S_3))
     end

This algorithm was considered by the authors as a sequential algorithm and can be expressed concisely in most sequential languages. The interesting aspect, however, is that the algorithm is in fact highly parallel in nature. There are two places in which parallelism appears in the algorithm:

The two recursive calls to QUICKSORT can be made in parallel.
We could fork these off as two parallel tasks, which in turn could each fork their recursive calls to two parallel tasks.
We can subselect the elements S_1 from S in parallel (similarly for S_2 and S_3).
If each processor has some subset of the input S, then clearly we can broadcast the pivot a and each processor can independently select the elements in its subset that are less than the pivot.

Although Quicksort can be considered a "toy" example, the algorithm is actually quite involved relative to many other parallel applications because it is highly dynamic (one does not know the sizes of the recursive calls ahead of time), and has multiple sources of parallelism (the recursive calls and the subselection for generating S_1).

Expressing Quicksort in Parallel

If Quicksort is indeed inherently parallel, it seems that we should be able to express it in a parallel language as concisely as in a sequential language. Much of the earlier work on parallelizing quicksort had only considered the parallelism from recursive calls [Hal85], which leas to very little parallelism --- in such a version just partitioning the elements will require O(n) time so the best speedup we could hope for over a sequential implementation would be O(lg n). Here we are interested in a version that expressed both forms of parallelism.

To appreciate the difficulty of programming a "toy" example such as Quicksort, we coded a parallel version of the algorithm that takes advantage of both forms of parallelism in C + MPI (MPI is a message passing interface for programming parallel machines that can be used as a library with C or Fortran). This required 1700 lines of code, including explicit load-balancing: hardly a toy. Here is the code for Quicksort in NESL:

In NESL parallelism is expressed with curly brackets. For example, in the above code the expression

  {e in S | e < a}

can be read as "in parallel for each e in S select all es that are less than a". Similarly

  {QUICKSORT(v): v in [S_1, S_3]}

can be read as "in parallel for each v in the sequence [S_1, S_3], QUICKSORT(v)."

Of course having concise parallel code does not imply an efficient implementation, and a major effort in our project has been in compiler techniques.

Our overall goal can therefore be summarized as: being able to write parallel code as concisely and clearly as sequential code, while achieving close to peak performance.

Back to the NESL page.