Suffix Arrays (SA):
Given a string generate its suffix array (the
sorted sequence of all suffixes of the input).
Input and Output File Formats
The input is an ascii string and the output is an integer sequence
in the sequence format. The integers
in the ouput represent locations in the input (0-based) and must
be in sorted order with respect to the lexicographic ordering of the
suffixes they point to.
Default Input Distributions
One of the inputs is synthetic and the other three are taken from real
sources. The difference in weight given to these distributions is due
to the difference in input length.
-
(20) A trigram string of length n=10,000,000.
trigramString <n> <filename>
-
(6) chr22.dna is a DNA sequence. It consists only of the
characters C,G,C,A,N and has about 34 million characters.
-
(1) etext99 is text from the project Guttenberg. It has
about 105 Million characters.
-
(1) wikisamp.xml is a sample from wikipedia's xml source files. It has
exactly 100 million characters.
This project has been funded by the following sources:
Intel Labs Academic Research Office for the Parallel Algorithms for Non-Numeric Computing Program,
National Science Foundation, and
IBM Research.