As in the uniprocessor prefetching algorithm, the goal of locality analysis
is to determine which references will hit in the cache so as to reduce
prefetching overhead. Locality analysis is slightly different for
multiprocessors because it takes explicit synchronization into account when
computing the localized iteration space, as we described earlier in Section
. To evaluate the effectiveness of locality
analysis, we perform an experiment similar to that in Section
by comparing indiscriminate
prefetching (where array references are prefetched all the time) with selective prefetching (which uses locality analysis). The results of this
experiment are shown in Figures
and
.
Figure shows that prefetching selectively
rather than indiscriminately offers speedups ranging from 23%to 39%. As
we saw earlier in the uniprocessor results, the reduction in memory stall
time is comparable between the two schemes, but the real difference is that
selective prefetching has significantly less overhead than indiscriminate
prefetching. In two cases (CHOLESKY and LOCUS), prefetching selectively
makes the difference between a performance loss and a performance gain.
Figure presents two useful metrics for
evaluating locality analysis: the percentage of unnecessary
prefetches, and the coverage factor. Comparing both of these metrics
between the two schemes, we see that once again selective prefetching
eliminates a significant fraction of unnecessary prefetches, while giving
up very little in terms of coverage factor. The reduction in total
prefetches ranges from a factor of 2.2 to a factor of 7.7.
Looking at the magnitude of unnecessary prefetches remaining after
selective prefetching in Figure (a), we see that
three applications are below 25%(OCEAN, MP3D, and CHOLESKY), while the
other two are above 45%(LU and LOCUS). The problem in both of these
latter cases is that they contain references to important data structures
which tend to fit in the cache, but which are occassionally modified by
other processors, thus resulting in coherence misses. While it is
difficult to predict which dynamic references to these structures will
suffer misses, it is even more difficult to isolate those miss
instances-i.e. techniques such as loop splitting are not applicable in
these cases. Since our algorithm is conservative about coherence misses,
it prefetches these references all the time, thus resulting in a
relatively large fraction of prefetches being unnecessary. We now
describe these two cases in more detail.
In LU, the important data structure is the pivot column. A pivot column is
produced by a single processor, and is then used repeatedly by the other
processors to modify their local columns. This process of applying a pivot
column to a local column occurs inside a procedure, which is called once
for each local column. The first time a given pivot column is applied to a
local column, the pivot column will miss in the cache, since it still
resides in the cache of the processor that produced the pivot column.
However, on subsequent calls of the same procedure with the same pivot
column, the column will hit since it is retained in the cache. Since our
prefetching algorithm does not look across procedure boundaries, it cannot
recognize the temporal locality of the pivot column, and therefore
prefetches it each time the procedure is called. Overcoming this limitation
statically would require not only interprocedural analysis, but also the
ability to either inline or clone [15] the
procedure body so that prefetches can be scheduled for only the first use
of a pivot column. Another possibility is to use dynamic information
in the form of hardware miss counters, thus allowing the procedure to adapt
dynamically to whether the pivot column is in the cache. As we will see
later in Section , this latter method is
effective at eliminating unnecessary prefetches in LU. However, even
without any of these improvements, we observe that despite the unnecessary
prefetching overhead in LU, it achieves the greatest speedup of all the
applications (more than twofold) since the advantage of prefetching the
pivot columns when they do miss is quite large.
In the case of LOCUS, the important data structure is the cost array, which represents the current placement of wires. Each processor's portion of the cost array tends to fit in its cache, but parts of it are occassionally invalidated as other processors rip up or put down new wires. Since these modifications occur erratically, the compiler cannot predict when they will occur and simply prefetches the cost array each time it is traversed. Therefore a significant fraction of the prefetches are unnecessary, but this is the only way to cover these coherence misses, and it does account for the bulk of the reduction in memory stalls.
LOCUS has the distinction of not only having the highest fraction of
unnecessary prefetches, but also of having the lowest coverage factor
(75%), as we see in Figure (b). The reason why
the coverage factor is not higher (which is also true for CHOLESKY) is that
many of the important array references are inside while loops rather
than for loops. Since a while loop has no induction variable
(either explicit or implicit), neither locality analysis nor software
pipelining is applicable.
Finally, we observe that locality analysis is successful at reducing
unnecessary prefetches while maintaining a respectable coverage factor
despite our conservative assumption that shared data does not remain in the
cache across synchronization statements. In LU and LOCUS, the remaining
unnecessary prefetches were not caused by adjusting the localized
iteration space due to explicit synchronization (as described in Section
). The problem in LU is procedure boundaries,
and there is no synchronization in the loops for LOCUS, since the cost
array is not protected by locks. In fact, when we recompiled each
application and did not take explicit synchronization into account,
we got precisely the same performance for each application. In other words,
explicit synchronization never changed our locality predictions in
significant ways. This occurred for two reasons. First, programmers realize
that synchronization is costly and therefore avoid putting it in inner
loops. Instead, synchronization tends to happen in outer loops and between
loop nests, which makes it less likely to affect loop-level data locality.
Second, in some cases the programmers decide that the performance advantage
of not locking data structures is worth whatever distortion it causes in
the result, as is the case in both MP3D and LOCUS. Therefore although
communication takes place, there is no explicit synchronization for our
algorithm to use as a hint. However, even if the compiler did realize that
coherence misses were occurring in these cases, the only choice it has is
to prefetch all the time, since the misses occur erratically. Therefore the
desired effect was achieved in our experiments despite the lack of
sophisticated analysis of communication patterns.
Overall, prefetching selectively through the use of locality analysis appears to be quite successful both for uniprocessors and multiprocessors. Once locality analysis has predicted which references to prefetch, the next step in our algorithm is scheduling those prefetches so that they are effective at hiding latency.