On a normal load miss, the data line is brought into all levels of the cache hierarchy, including the primary cache, which makes sense because the processor stalls waiting for the data. On a prefetch miss, however, it is not as obvious that the data should be placed in the primary cache for two reasons. First, the prefetched data may displace other data that will be referenced before the prefetched data, thereby causing an additional miss rather than eliminating a miss. The second problem is that a new source of contention is added since the processor cannot execute loads or stores at the same time that the primary cache is being filled with prefetched data. In our experiments so far, we modeled this effect by assuming that four cycles of exclusive access are needed for each prefetch fill. Therefore a load or store may be delayed by up to four cycles.
Although we have just discussed two potential downsides of prefetching into the primary cache, the upside, of course, is that the latency of a primary-to-secondary miss can be hidden. Note that these downsides are not as applicable to prefetching into the secondary cache, since (i) the secondary cache is quite large (and therefore the chance of displacing important data is small), and (ii) the processor does not actively contend for the secondary cache tags. Therefore there is little argument that prefetched data should be placed in the secondary cache. Consequently, the key question is whether or not to place the data in the primary cache.
To answer this question, we will use both analytical and experimental
approaches. Let us begin with an analytical approach of
comparing expected costs and benefits. The benefit of prefetching into
the primary cache is that for each successful prefetch (), the
latency of a primary-to-secondary miss (
) is eliminated:
This includes both cases where the prefetched data is found in
the secondary cache (in which case represents the entire miss
latency), and cases where the prefetched data is found in memory (in which
case
is only a fraction of the total primary-to-memory miss
latency).
The cost of prefetching into the primary cache includes the effects of both increased conflicts and contention:
Additional conflicts occur whenever prefetches displace useful data,
where is the number of such conflicts, and each conflict results in
an extra primary-to-secondary miss (
) to later fetch the displaced
data. Additional primary cache tag contention can occur during prefetch
fills, where
is the number of prefetch fills, and
is the
average number of cycles that a prefetch fill stalls the processor. In
practice,
should be less than the total fill time, since the
processor only stalls if it attempts to execute loads or stores during
the fill.
To simplify the algebra in order to directly compare the costs and
benefits, we will make the following substitutions into
equations () and
(
). First, we will express the number of
successful prefetches (
) in
equation (
) as the difference between
the cases where the prefetched data is brought into the cache early enough
(
) and the cases where that data is displaced before it can be
referenced (
), hence
. Next, we will conservatively
represent the number of cases where useful data is displaced from the cache
(
) in equation (
) as the number of
cases where prefetches are displaced (
), since in the worst case, each
displaced prefetch will also displace useful data.
Finally, we replace
the number of prefetch fills (
) in
equation (
) with
, the total number of
prefetches that arrive early enough in the cache. After making these
substitutions, we note that prefetching into the primary cache is
worthwhile whenever the following inequality holds:
By rewriting equation (), we can
express the maximum value of
(the average
processor stall on a prefetch fill) at which prefetching into the primary
cache is advantageous:
Equation () shows that in the absence of primary
cache conflicts due to prefetches (i.e.
), it is worthwhile
to prefetch into the primary cache as long as the average stall time due to
primary cache tag contention (
) is less than the primary-to-secondary
miss latency (
). This condition should normally be satisfied since
filling the primary cache is almost always a subset of the time it takes to
service a primary-to-secondary miss. As the number of primary cache
conflicts induced by prefetching (
) increases, the maximum acceptable
value of
will decrease.
Since primary cache conflicts are an important concern, we rewrite
equation () once again, this time solving for
:
Notice from equation () that in the absence of
primary cache tag contention (i.e.
), the break-even point
occurs when half of the prefetches arriving in the cache early enough
(
) cause cache conflicts. In
this case, half of the prefetches would be successful (thereby eliminating
a cache miss), while the other half would displace useful data (thereby
adding a new cache miss), and both effects would exactly cancel each other
out. As the amount of contention (
) increases, it decreases the
acceptable number of conflicts (
).
To predict whether prefetching into the primary cache is worthwhile for the
uniprocessor architecture used in Chapter , we
can solve equation (
) given the parameter
values of that architecture:
cycles, and
cycles.
(Note that this is an upper bound on
since the processor does not
necessarily stall for the entire duration of a primary prefetch fill, which
is 4 cycles in this case.) Substituting these values into
equation (
), we get the following:
Therefore we would expect prefetching into the primary cache to be
worthwhile as long as fewer than a third of the prefetches suffer primary
cache conflicts. The actual values of and
are shown
for each application in Table
.
We see that the values of
ranges from 1.2 to
2.5 cycles, which is about half of the full primary fill time of four
cycles. This makes sense since loads and stores typically occur at most
once every other instruction. We also see from Table
that the only application where cache conflicts
potentially occur too often is CHOLSKY, where 37%of prefetches placed in
the primary cache are displaced before they can be referenced. However,
since CHOLSKY also happens to suffer less than average from primary cache
contention (
), the number of displaced prefetches is below the
maximum for breaking even (which happens to be 44%for this case). Also,
the number of additional misses caused by conflicting prefetches is
overstated in this case, since only about half of the displaced prefetches
were displaced by data references (the other half were displaced by other
prefetches, and therefore did not incur additional misses)-i.e.
for CHOLSKY. Hence the analytical model predicts
that for the given architecture, prefetching into the primary cache should
be a performance win for each of these applications.
To evaluate whether this is true in practice, we simulated the performance
of each uniprocessor application, this time only prefetching into the
secondary cache. The results of that experiment are shown in Figure
. As we see in this figure, prefetching into
the primary cache is in fact a performance win in all cases. In seven of
the thirteen cases, there is a speedup of at least 19%due to prefetching
into the primary cache rather than just the secondary cache.
The marginal improvement of prefetching into the primary
cache depends on the relative frequency and cost of primary misses that are
found in the secondary cache versus misses that go all the way to
memory. If the data tends to fit in the secondary cache, then the entire
benefit of prefetching will come from prefetching into the primary
cache. In the other extreme,
if none of the primary misses are found in the secondary cache, then the
marginal improvement will be roughly the ratio of the
primary-to-secondary and primary-to-memory miss latencies (which would be
12/75 = 16%for this architecture).
To help characterize where the data is being found, Table
presents a breakdown of exactly where the data
is found both by the prefetch and by the subsequent reference. When the
value of the ``C
C
'' category (i.e. cases where the
data are successfully prefetched from the secondary to the primary cache)
is large relative to the sum of the ``M
C
'' and ``M
C
'' categories (i.e. cases where data are prefetched
from memory, and are found in either the primary and secondary cache,
respectively), we would expect prefetching into the primary cache to
account for most of the gains. This agrees with the data in Figure
. For example, GMTRY has the largest value in the
``C
C
'' column (73.5%), and also shows the largest
marginal speedup (45%). In contrast, the prefetched misses in TOMCATV are
twice as likely to be found in memory rather than the secondary cache, and
therefore the marginal speedup for TOMCATV is considerably smaller (4%).
For the experiments in Figure , we did not
recompile the code to take into account the larger size of the secondary
cache relative to the primary cache (256K vs. 8K). Compiling for the
larger cache size might help reduce instruction overhead by eliminating
prefetches of data that reside in the secondary but not the primary cache.
However, notice that there are only a few cases in Figure
where the instruction overhead could potentially
offset the marginal difference in reduced memory stalls. To see whether
this would make a difference, we recompiled each program, this time
increasing the effective cache size parameter to 256 Kbytes. The
results of this experiment are shown in Figure
.
As we see from Figure , the results of
recompiling the code to take into account the larger cache size were mixed.
In five of the cases (MXM, EMIT, VPENTA, OCEAN, and EP) the performance
improved, while in six of the other cases (CFFT2D, CHOLSKY, BTRIX, TOMCATV,
CG, and MG) the performance was worse. Of the cases that did improve, the
benefit was typically small (never more than 5%). In six of the cases the
instruction overhead actually increased (due to the overheads of peeling
loops to account for temporal locality), despite the fact that fewer
prefetches were issued. Of the four cases where there was a significant
potential to reduce instruction overhead (BTRIX, VPENTA, TOMCATV and MG),
only one case improved: VPENTA. However, even in this case, much of the
gain in reduced instruction overhead was offset by larger memory stall
times, since a number of the prefetches that were eliminated were in fact
useful for reducing latency. In the case of BTRIX, we see that compiling
for a larger cache size does not reduce the instruction overhead (it merely
eliminates some useful prefetches), and given that these prefetches are
being issued, it is clearly preferable that they do something useful (i.e.
prefetch data into the primary cache).
Part of the reason for these mixed results is that it is more difficult to analyze locality in a very large cache, since the approximations inherent in using a fully-associative cache model in the locality analysis become more exaggerated. Also, in many cases the loop bounds are symbolic, and in these cases it does not matter how large the finite cache parameter is-all that matters is the policy for dealing with unknown loop bounds. Since it is unlikely that the compiler will be able to avoid prefetching data already in the secondary cache, we may as well get some performance benefit from these prefetches by moving the data into the primary cache. As secondary caches continue to grow, this should be even more true. Ultimately, however, it is the difference in memory stall reduction, not overheads, that shows that prefetching into the primary cache is the right thing to do.
Although the results presented in this section are specific to this
architecture, we would expect the conclusions to remain applicable in the
future for the following reasons. First, the average stall due to primary
cache tag contention () should remain smaller
than the primary-to-secondary miss latency (
). Also, as cache sizes
continue to increase, there may be fewer primary cache conflicts
(
), and more of the data set may be captured
by the secondary cache, making it more important to prefetch
primary-to-secondary cache misses. Therefore it is likely that prefetching
into the primary cache will remain a performance win.