Dynamic information can be used at compile-time through feedback from earlier runs. The two types of feedback that may be useful for prefetching are control-flow feedback and memory behavior feedback, which we will briefly discuss.
Control-flow feedback (also known as ``branch profiling'') records how
frequently each control path in the program is executed, thereby measuring
branch outcome frequencies. Such information is often used in aggressive
instruction-scheduling algorithms where it is important to schedule code
beyond branches [75]. Control-flow feedback would be useful
in our prefetching compiler algorithm (described in Chapter
) when computing the localized iteration space
for loop nests that include symbolic loop bounds. In these cases, the
number of loop iterations is not a compile-time constant, and therefore it
is difficult to predict whether the volume of data referenced by a loop
exceeds the effective cache size. Using control-flow feedback, the compiler
can take into account the average number of loop iterations of previous
runs when making this decision. Control-flow feedback might also be useful
for loops that contain conditional statements. For example, the number of
iterations to software-pipeline prefetches ahead depends upon the expected
number of instructions in a single loop iteration (see
Section
). If the loop contains a conditional
statement, we currently schedule for the worst case (i.e. the shortest
path). However, if the branch outcome almost always favors a much longer
path, the compiler algorithm may want to schedule for this longer path
instead. Control-flow information is relatively inexpensive to collect,
since it simply involves instrumenting the code to count basic
blocks.
Feedback on memory behavior helps identify which references in the code are suffering the most from memory latency. This information can be collected at various levels of granularity. For example, the Mtool utility [28] collects information at the granularity of a single loop nesting. This is useful for identifying which loop nests suffer the most from latency, and which loop nests are insignificant. At the more fine-grained end of the spectrum, the feedback information might consist of precise miss rates for each reference in the program. One of the challenges of collecting any type of memory behavior information is that the behavior of the memory subsystem is usually hidden from the software. Mtool collects its information through either fine-grain timers or by sampling the program counter. Because of the potential distortion introduced by the large overheads of these techniques, it is only possible to collect information at the level of a loop nest. In order to collect individual miss rates in a program, it is necessary to either simulate the memory system (which is rather costly), or else make use of user-visible hardware miss counters. However, few commercial microprocessors currently provide support for such miss counters, and therefore collecting individual miss rates remains a non-trivial task.
Memory behavior feedback may be useful in a number of different situations. First, it may be useful when the localized iteration space (LIS) concept would work, but the LIS has been chosen improperly-perhaps because the loop bounds are unknown, or because the effective cache size is incorrect. In this case, the LIS can be adjusted to better match the observed behavior. Second, memory behavior feedback can help detect cases where cache conflicts result in more misses than locality analysis predicts. Similarly, in a multiprocessor environment, it can help detect cases where coherence activity causes additional misses beyond those predicted by locality analysis. Finally, it will be useful for determining whether to prefetch references outside the scope of locality analysis, such as indirect references.
To experiment with feedback, we used our simulator to collect both control-flow and fine-grain memory behavior feedback information. This information was incorporated into the compiler algorithm as follows. The control-flow information was used to compute the average number of iterations for each loop (also known as the ``trip count''). As we described earlier in this subsection, this count was used by the compiler to compute the localized iteration space whenever the loop bounds were unknown at compile-time.
The memory feedback information was used to annotate each load and store with both the number of times the reference was executed and the number of times it missed in the primary data cache-hence giving the precise miss rate. After performing locality analysis, we compare the predicted miss rate with the observed miss rate. If they agree within a certain acceptable margin, we schedule the code as usual. If they disagree, then we first try to find an explanation for the miss rate that is consistent with the intrinsic data reuse. This is important because to schedule the prefetches properly, we need to know when the misses occur.
For example, consider the two loop nests in
Figure , and assume that from
memory feedback we know the miss rate for loading A[j][k] to be 25%in both cases. Since the intrinsic data reuse for A[j][k] is quite
different between the two cases, the interpretation of the miss rate also
differs. For the code in
Figure
(a), where A[j][[k]
has temporal reuse along the i loop (k is loop-invariant), a
25%miss rate corresponds to each load of A[j][k] missing on the
first iteration of i and hitting on the remaining three iterations.
Therefore the compiler would peel the i loop to schedule prefetches
as normal for temporal locality. In contrast, the A[j][k] reference
in Figure
(b) has spatial rather
than temporal reuse, and therefore we would not expect the same dynamic
miss behavior. Assuming there are four A[j][k] elements per cache
line, the 25%miss rate would correspond to missing on every fourth
iteration of the inner loop (k) as cache line boundaries are crossed.
So rather than peeling the outer loop, in this case the compiler would
unroll the inner loop by a factor of four to schedule the prefetches.
This example in Figure illustrates
that the miss rate alone does not necessarily provide enough information to
schedule prefetches properly, since fractional miss rates (i.e. between
0%and 100%) are ambiguous in terms of which dynamic references are hits
and which are misses. This ambiguity arises because the information
relating individual misses to when they occur is lost in the course of
summarizing them as a single miss rate. Even after examining the data reuse
within a loop nest, the dynamic misses may still be unclear. For example,
the 25%miss rate for loading A[j][k] in
Figure
(a) may correspond to at
least two different miss patterns. First, if the A[j][k] references
are not already in the cache when the loop is executed, then all of the
misses would occur during the first iteration of the i loop, as we
described earlier. However, if the A[j][k] references were already in
the cache (perhaps because they were referenced in an earlier loop nest),
then the misses may have occurred sporadically across all of the i loop iterations, perhaps due to occasional cache conflicts with other
references. Prefetches scheduled for the former case would not cover the
misses of this latter case. Therefore the compiler is faced with choosing
the most likely explanation for the observed miss rates.
The approach we take is to first look for explanations consistent with
locality analysis by adjusting the LIS and repartitioning equivalence
classes as necessary. For example, if the actual miss rate is lower than
expected, the compiler checks whether increasing the size of the LIS
(thereby increasing the expected locality) would explain the miss rate.
Similarly, it considers decreasing the LIS to explain higher-than-expected
miss rates. This is particularly useful when the size of the LIS is unclear
from static analysis alone. For example, given an 8 Kbyte primary data
cache with 16 byte lines, it would be difficult to predict whether the LIS
in Figure (a) included both
surrounding loops since the amount of data referenced in a single iteration
of the outer loop is very close to the cache capacity. With only static
information, the compiler might normally predict that only the inner loop
was within the LIS, and hence the expected miss rate of loading A[j][k] would be 100%. However, given the observed miss rate of 25%from
feedback, the compiler would increase the LIS to include both loops.
Therefore the A[j][k] reference would be predicted to have temporal
locality, and its predicted miss rate would match the observed rate of
25%.
The compiler adjusts the partitioning of equivalence classes (sets of references predicted to have group locality) based on feedback as follows. First the compiler checks that the leading reference (i.e. the reference predicted to actually suffer the misses) of each equivalence class is the only one with a significant miss rate. If multiple references in the same equivalence class have significant miss rates, then the equivalence class is divided accordingly. If none of the references in the equivalence class has a significant miss rate (including the leading reference), then that equivalence class is marked as one that should not be prefetched. To determine which references should not be prefetched, the compiler ranks all references in descending order based on their contribution to total misses. A reference is not prefetched if it has a small miss rate and is ranked lower than the references accounting for the first 95%of the misses. This is a better approach than simply using a miss rate threshold, since it takes the frequency of reference into account. In other words, if a reference has a low miss rate but is executed so often that it accounts for most of the total misses, we still want to prefetch it. On the other hand, a reference with a higher miss rate that is rarely executed and therefore makes a negligible contribution to total misses can safely be ignored.
To evaluate the benefit of using feedback, we modified our compiler to
automatically make use of control-flow and fine-grain memory feedback
information. We simulated each application once without prefetching, using
our simulator to automatically collect the feedback information. We then
compiled each application a second time, this time using the feedback
information. Figure shows the performance of
the one case that improved significantly using feedback: OCEAN.
The difficulty for static analysis in OCEAN is that the critical two-level loop nesting is split across separate files-the outer loop is in one file, and the inner loop is inside a procedure call in another file. This loop performs a nearest-neighbor SOR computation, so there is a significant amount of group locality among the references. However, since our compiler does not perform interprocedural analysis across separate files, the prefetching algorithm does not recognize the group locality due to the outer loop, and therefore issues too many prefetches. Once feedback information is available, the compiler immediately recognizes the group locality, thus avoiding the unnecessary prefetches. Interestingly enough, eliminating prefetches actually reduces the memory stall time in this case by eliminating register spilling, since the spilled references were often conflicting with other data references. OCEAN illustrates how feedback may help the compiler overcome the fact that some types of analysis may be too expensive to implement in the compiler.
The main reason why feedback did not improve the other benchmarks is that static analysis alone was already quite successful for these types of codes. Profiling feedback is likely to be more useful in codes that are more difficult to analyze, such as ones containing pointers and recursive data structures. For the scientific codes we studied, however, the benefits of feedback do not appear to be large.