Abstract: We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM-2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in the literature.
Our computational experiments show that the sample sort algorithm, which is a theoretically efficient "randomized" algorithm, is the fastest of the three algorithms on large data sets. On a 64K-processor CM-2, our sample sort implementation can sort 32 x 10^6 64-bit keys in 5.1 seconds, which is over 10 times faster than the CM-2 library sort. Our implementation of radix sort, although not as fast on large data sets, is deterministic, much simpler to code, stable, faster with small keys, and faster on small data sets (few elements per processor). Our implementation of bitonic sort, which is pipelined to use all the hypercube wires simultaneously, is the least efficient of the three on large data sets, but is the most efficient on small data sets, and is considerably more space efficient. This paper analyzes the three algorithms in detail and discusses many practical issues that led us to the particular implementations.
@inproceedings{cmsort, author = "Guy~E. Blelloch and Charles~E. Leiserson and Bruce~M. Maggs and C.~Gregory Plaxton and Stephen~J. Smith and Marco Zagha", title = "A Comparison of Sorting Algorithms for the {Connection Machine CM-2}", booktitle = "Proceedings Symposium on Parallel Algorithms and Architectures", pages = "3--16", address = "Hilton Head, SC", month = jul, year = 1991 }