This report describes and defines the data-parallel language NESL. The language was designed with the following goals:
NESL is a strongly-typed strict first-order functional (applicative) language. It runs within an interactive environment and is loosely based on the ML language [34]. The language uses sequences as a primitive parallel data type, and parallelism is achieved exclusively through operations on these sequences [13]. The set of sequence functions supplied by NESL was chosen based both on their usefulness on a broad variety of algorithms, and on their efficiency when implemented on parallel machines. To promote the use of parallelism, NESL supplies no serial looping constructs (although serial looping can be simulated with recursion). NESL has been used for 3 years now for teaching parallel algorithms [10], and many applications and algorithms have been written in the language [22, 4, 5].
NESL is the first data-parallel language whose implementation supports nested parallelism. Nested parallelism is the ability to take a parallel function and apply it over multiple instances in parallel--for example, having a parallel sorting routine, and then using it to sort several sequences concurrently. The data-parallel languages C* [38], *Lisp [31], and Fortran 90 [1] (with array extensions) support no form of nested parallelism. The parallel collections in these languages can only contain scalars or fixed sized records. There is also no means in these languages to apply a user defined function over each element of a collection. This prohibits the expression of any form of nested parallelism. The languages Connection Machine Lisp [45], and Paralation Lisp [39] both supply nested parallel constructs, but no implementation ever supported the parallel execution of these constructs. Blelloch and Sabot implemented an experimental compiler that supported nested-parallelism for a small subset of Paralation Lisp [13], but it was deemed near impossible to extend it to the full language.
A common complaint about high-level data-parallel languages and, more generally, in the class of languages based on operations over collections [42], such as SETL [40] and APL [29], is that it can be hard or impossible to determine approximate running times by looking at the code. As an example, the primitive in CM-Lisp (a general communication primitive) is powerful enough that seemingly similar pieces of code could take very different amounts of time depending on details of the implementation of the operation and of the data structures. A similar complaint is often made about the language SETL--a language with sets as a primitive data structure. The time taken by the set operations in SETL is strongly affected by how the set is represented. This representation is chosen by the compiler.
For this reason, NESL was designed so that the built-in functions are quite simple and so that the asymptotic complexity can be derived from the code. To derive the complexity, each function in NESL has two complexity measures associated with it: the work and depth complexities [13] The work complexity represents the serial work executed by a program--the running time if executed on a serial RAM. The depth complexity represents the deepest path taken by the function--the running time if executed with an unbounded number of processors. Simple composition rules can be used to combine the two complexities across expressions and, based on Brent's scheduling principle [14], the two complexities place an upper bound on the asymptotic running times for the parallel random access machine (PRAM) [19].
The current compiler translates NESL to VCODE [9], a portable intermediate language. The compiler uses a technique called flattening nested parallelism [13] to translate NESL into the simpler flat data-parallel model supplied by VCODE. VCODE is a small stack-based language with about 100 functions all of which operate on sequences of atomic values (scalars are implemented as sequences of length 1). A VCODE interpreter has been implemented for running VCODE on the Cray C90 and J90, the Connection Machine CM-5, or any machine serial machine with a C compiler [8]. We also have an MPI [20] version of VCODE [25], which will run on machines that support MPI, such as the IBM SP-2, the Intel Paragon, or clusters of workstations. The sequence functions in this interpreter have been highly optimized [13, 17] and, for large sequences, the interpretive overhead becomes relatively small yielding high efficiencies.
The interactive NESL environment runs within Common Lisp and can be used to run VCODE on remote machines. This allows the user to run the environment, including the compiler, on a local workstation while executing interactive calls to NESL programs on the remote parallel machines. As in the Standard ML of New Jersey compiler [2], all interactive invocations are first compiled (in our case into VCODE), and then executed.
Control parallel languages that have some feature that are similar to NESL include ID [35, 3], Sisal [32], and Proteus [33]. ID and Sisal are both side-effect free and supply operations on collections of values.
The remainder of this section discusses the use of sequences and nested parallelism in NESL, and how complexity can be derived from NESL\ code. Section 2 shows several examples of code, and Section 3 along with Appendix A and Appendix B defines the language. Shortcomings of NESL include the limitation to first-order functions (there is no ability to pass functions as arguments). We are currently working on a follow-up on NESL, which will be based on a more rigorous type system, and will include some support for higher-order functions.