In this subsection, we evaluate the performance of our algorithm on several applications that contain indirect array references. Two of these cases (CG and SPARSPAK) are sparse-matrix codes, where the indirect references occur while mapping between the sparse matrix and the dense storage array. IS is an integer sort program that uses the ``bucket sort'' algorithm; in this case, the indirect references occur while computing the the number of occurrences of each integer in a dense histogram-style array. Finally, MP3D is a rarified-air particle simulator, where the indirect references occur when the (x,y,z) coordinates of a particle are used to look up the matching ``space cell'' in a 3-dimensional matrix.
Figure shows the results of our experiment. For each
application, we show three performance bars: (i) the original case without
prefetching (N), (ii) the case where we only attempt to prefetch
dense (i.e. affine) references (D), and (iii) the case where we
prefetch both dense and indirect references (B).
We see in Figure that prefetching significantly improves the
performance of each application, with over two thirds of memory stall time
eliminated, and speedups ranging from 10%to 100%. The interesting question,
however, is how much of this gain came from prefetching indirect references.
The answer to this varies across the applications. In the case of IS, roughly
half of the benefit was from prefetching the indirect references. At the other
extreme, SPARSPAK saw no additional benefit from prefetching indirect
references. This may seem a bit surprising, given that SPARSPAK is a
sparse-matrix application. However, in SPARSPAK the elements of the sparse
matrix are actually stored in a dense matrix, and the indirect references
are to pointers into the appropriate rows or columns of the matrix. Therefore
the data set size of the indirect references is relatively small, and tends to
fit into the primary cache.
Another interesting issue with prefetching indirect references is
instruction overhead. In Table , we
isolate the average instruction overhead for prefetching only the indirect
references. Notice that these numbers are generally larger than the numbers
we saw earlier in Table
. The reason for this is
that prefetching an affine reference can take as little as a single
instruction (the prefetch instruction itself), since the prefetch address
can often be generated by simply changing the constant offset with respect
to another load or store. However, computing the prefetch address for an
indirect reference requires a minimum of four instructions: (1) a load, to load the index value; (2) a shift, to convert the index to
the number of bytes in an array element; (3) an add, to add the byte
offset to the starting address of the array; and (4) the prefetch
instruction itself. In the case of SPARSPAK, our compiler achieved this
bare minimum. For IS and CG, the numbers are also respectable. The case
that stands out is MP3D, where generating an indirection prefetch address
requires an average of 25 instructions. The reason for this is that MP3D
must dereference three index arrays (x, y, and z)
to compute each indirect reference, and in addition these index values must
be converted from floating-point to integer values.
In theory, it may be possible to reduce the prefetching instruction overheads further by saving the indirection address in a register between the time it is computed for the prefetch and the time it is used for the load or store. However, there are two complications with this optimization. First, the compiler must be absolutely certain that the index value will not change during this interval. Otherwise, not only will the wrong location be prefetched (which only hurts performance), but more importantly the wrong location will be used by the load or store, and therefore the program will produce incorrect results. Second, storing the address values between the prefetch and the load or store will consume additional registers, which may result in the large overheads of register spilling. To avoid both of these problems, we chose to simply re-compute the indirect addresses each time they are needed in our compiler implementation.
Overall, the gains in reduced memory stalls were large enough to outweigh
the increased instruction count in all cases. Although the performance
gains for MP3D are small in the uniprocessor case, we will see later in
Section that the gains are large in the
multiprocessor case. The bottom line is that through a simple extension of
the core algorithm, our compiler is now able to prefetch both affine and
indirect array references, which covers a large fraction of array-based
scientific and engineering applications.