Abstract: We have implemented an efficient fully vectorized and parallelized radix sort algorithm on the Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided from Fortran. On eight processors we achieve an additional speedup of almost 5, yielding a routine over 25 times faster than the library sort. Using this multiprocessor version, we can sort at a rate of 15 million 64-bit keys per second.
This paper describes the parallel radix-sort algorithm, introduces some techniques for mapping parallel algorithms onto vector multiprocessors, discusses our implementation of the radix-sort on the Y-MP, and derives equations to predict and optimize the performance of the sort. The most interesting elements of the implementation are a vectorized and parallelized histogram operation, a vectorized and parallelized scan (all-prefix-sums) operation, and a memory access pattern designed to avoid memory bank conflicts. The algorithm is adapted from a data-parallel algorithm previously designed for a highly parallel Single Instruction Multiple Data (SIMD) computer, the Connection Machine CM-2.
@inproceedings{cray-sort, author = "Marco Zagha and Guy~E. Blelloch", title = "Radix Sort for Vector Multiprocessors", booktitle = "Proceedings Supercomputing~'91", pages = "712--721", month = nov, year = 1991 }