Although incorporating dynamic feedback information into the compilation process gives the compiler more information to reason with, it has a few shortcomings. First, the feedback process itself is a bit cumbersome, since it requires that the program be compiled twice. Also, depending on the level of detail of the information being collected, the process of executing the code to collect the information may be quite slow. Another difficulty is whether the dynamic profile that was captured is representative of all input data sets. This is a concern because optimizations tailored to one particular input data set may actually hurt performance on other data sets. In some cases it may not be possible to find a single ``representative'' data set, especially if the cache behavior depends critically upon whether the data set does or does not fit in the cache. This is particularly problematic when the problem size is determined at run-time, in which case it may be impossible to generate a single static piece of code that is appropriate for all input data sets. A similar problem arises in procedures such as library routines where it is impossible to analyze the calling arguments in all situations.
Rather than generating a single static piece of code that contains prefetching, another possibility is to generate code that adapts dynamically at run-time. This involves ``specialization'' of the prefetching code. The good news is that there are typically only a small number of different cases to specialize for-usually the data either should or should not be prefetched. Therefore when the compiler is uncertain about which case to generate, it could generate both cases, and choose the appropriate one to execute at run-time. The decision of which version to execute might be based on several different things: (1) the problem size (e.g., loop bounds defined at run-time), (2) data alignment in order to detect pathological cache conflict cases, or (3) hardware miss counters.
For example, Figure shows a case where
checking the problem size is useful. The temporal reuse of the A[j]
reference in Figure
(a) may or may not
result in temporal locality, depending on how large n is relative
to the cache size. Figure
(b) illustrates
how the value of n can be checked at run-time to select the
appropriate strategy for prefetching A[j].
Another problem that can potentially be detected dynamically is
pathological cache interference between array references. When this occurs,
the associated prefetches may also be ineffective, so it may be best to
avoid the instruction overhead of issuing these useless prefetches. This
situation can occur when two array references have similar affine access
functions, such as references A[i] and B[i] in Figure
(a), and when the difference in base
addresses is a multiple of the cache size.
Figure
(b) illustrates code that adapts to
this situation at run-time.
Finally, Figure illustrates how hardware
miss counters can be used to dynamically detect whether data is already in
the cache. These miss counters keep a running total of primary cache
misses. It may be useful to have four miss counters, with each one devoted
to counting one of the following types of misses: (i) loads, (ii) stores,
(iii) prefetches, and (iv) exclusive-mode prefetches. Keeping separate
counts is useful because it provides more information to the software.
Also, there is a significant difference between seeing loads and stores hit
in the cache (which is a good thing), and seeing prefetches hit in the
cache (which is a bad thing). Miss counters should not be too expensive to
implement since they only consume a small amount of chip area and should
not affect cycle time.
As we see in Figure , the run-time
overhead of using miss counters can be minimized as follows. First reset
the counter before executing a loop. Then after executing some number of
loop iterations, stop and check the counter to see if the data has been in
the cache so far. If so, do not prefetch the data for the remaining loop
iterations. Otherwise, continue prefetching. Assuming there are a
reasonable number of loop iterations (the software can also test for this),
the relative overhead of this check should be quite small. One way to
detect misses is to start off the first several iterations without
prefetching, and then check to see how many load or store misses occur. The
drawback of this approach is that it does not eliminate the miss latency of
those first several iterations. Perhaps a better approach, as illustrated
in Figure
(b), is to start off issuing
prefetches for the first several iterations, and then test to see whether
the prefetches have been hitting in the cache. If they have, assume
the data is already in the cache and do not issue further prefetches.
While Figure shows how miss counters can
be used within a single-level loop nest, their run-time overhead can be
reduced even further by using them across outer loop iterations in
multi-level loop nests. For example,
Figure
demonstrates how miss counters
can be used to adapt to temporal locality along an outer loop.
Having described how hardware miss counters can be used, we will now show
experimental results for two different benchmarks: BCOPY, and LU. We start
with BCOPY, which is a ``block copy'' library routine for copying a block
of data from one location to another. Although BCOPY is a very simple
routine, it is interesting for two reasons. First, since it is a library
routine, the compiler cannot make any assumptions about the input
parameters and cannot analyze the call sites for locality. Second, since
BCOPY is frequently called by the operating system to move data around,
improving BCOPY may have a significant impact on system performance. We
rewrote BCOPY by hand to take advantage of hardware miss counters, similar
to the code in Figure . We added hardware
miss counters to our simulator, and made them visible to the software
through special function calls. We used a simple workload to drive BCOPY,
since we were mainly interested in testing the ends of spectrum. The
workload consisted of a loop which repeatedly called BCOPY with two
distinct arrays and a given block size as arguments. Since identical block
copies are performed on subsequent iterations, there can potentially be a
temporal locality benefit if the block can remain in the cache. Figure
shows the performance of BCOPY using various
block sizes and loop iteration counts.
As we see in the ``500x10'' case (where BCOPY is called ten times with
500 byte blocks) in Figure , the original code
without prefetching suffers a significant amount of miss latency. These
misses occur the first time the routine is called, since the data remains
in the cache for subsequent calls. As we see from the middle bar, the code
which statically prefetches the blocks all the time actually performs worse
than the original case, due to the large instruction overhead of the many
unnecessary prefetches. The dynamically adaptive code (shown as the
righthand bar) performs the best, since it recognizes that the data should
be prefetched the first time, but not on subsequent calls. The other two
cases in Figure
show larger block sizes that do
not fit in the cache. In these cases it is best to prefetch all the time,
so stopping to check the miss counters is pure overhead. However, we see
that this overhead is small enough that it has no noticeable impact on
performance. Therefore the adaptive code provides the benefit of
eliminating unnecessary prefetch overhead without introducing any
significant cost.
The other application we modified to exploit hardware miss counters was LU,
which performs parallel LU-decomposition on dense matrices. In LU, the same
procedure is called repeatedly to apply a pivot column to other columns. If
the columns are large relative to the cache size, then it is best to
prefetch the pivot column each time it is referenced. On the other hand, if
a column can fit in the cache, then the pivot column will only suffer
misses the first time it is referenced. However, since this code is in a
separate procedure, and since our compiler does not perform procedure cloning [15], the only static option is to prefetch the
column all the time. Once again, we modified this procedure by hand,
similar to BCOPY, to use hardware miss counters. The results of our
experiments with LU are shown in Figure .
We ran LU with both large and small problem-size to cache-size ratios. The
lefthand side of Figure shows the small ratio,
meaning that columns tend to fit in the cache. In this case, the static
approach of prefetching the columns all the time suffers from a significant
amount of instruction overhead. In contrast, the adaptive code eliminates
much of this unnecessary overhead while still hiding much of the memory
latency, thereby resulting in the best overall performance. Looking at the
righthand side of Figure
, where the data size is
large relative to the cache, we see the interesting result that the
adaptive code still offers the best performance. This is because LU
performs a triangular matrix solve, and therefore the last several columns
always tend to fit in the cache. The savings of not issuing unnecessary
prefetches for these last several columns more than offsets the instruction
overhead of the dynamic test in this case. Therefore the adaptive code
is clearly the best choice for LU.