On a normal load miss, the data line is brought into all levels of the cache hierarchy, including the primary cache, which makes sense because the processor stalls waiting for the data. On a prefetch miss, however, it is not as obvious that the data should be placed in the primary cache for two reasons. First, the prefetched data may displace other data that will be referenced before the prefetched data, thereby causing an additional miss rather than eliminating a miss. The second problem is that a new source of contention is added since the processor cannot execute loads or stores at the same time that the primary cache is being filled with prefetched data. In our experiments so far, we modeled this effect by assuming that four cycles of exclusive access are needed for each prefetch fill. Therefore a load or store may be delayed by up to four cycles.
Although we have just discussed two potential downsides of prefetching into the primary cache, the upside, of course, is that the latency of a primary-to-secondary miss can be hidden. Note that these downsides are not as applicable to prefetching into the secondary cache, since (i) the secondary cache is quite large (and therefore the chance of displacing important data is small), and (ii) the processor does not actively contend for the secondary cache tags. Therefore there is little argument that prefetched data should be placed in the secondary cache. Consequently, the key question is whether or not to place the data in the primary cache.
To answer this question, we will use both analytical and experimental approaches. Let us begin with an analytical approach of comparing expected costs and benefits. The benefit of prefetching into the primary cache is that for each successful prefetch (), the latency of a primary-to-secondary miss () is eliminated:
This includes both cases where the prefetched data is found in the secondary cache (in which case represents the entire miss latency), and cases where the prefetched data is found in memory (in which case is only a fraction of the total primary-to-memory miss latency).
The cost of prefetching into the primary cache includes the effects of both increased conflicts and contention:
Additional conflicts occur whenever prefetches displace useful data, where is the number of such conflicts, and each conflict results in an extra primary-to-secondary miss () to later fetch the displaced data. Additional primary cache tag contention can occur during prefetch fills, where is the number of prefetch fills, and is the average number of cycles that a prefetch fill stalls the processor. In practice, should be less than the total fill time, since the processor only stalls if it attempts to execute loads or stores during the fill.
To simplify the algebra in order to directly compare the costs and benefits, we will make the following substitutions into equations () and (). First, we will express the number of successful prefetches () in equation () as the difference between the cases where the prefetched data is brought into the cache early enough () and the cases where that data is displaced before it can be referenced (), hence . Next, we will conservatively represent the number of cases where useful data is displaced from the cache () in equation () as the number of cases where prefetches are displaced (), since in the worst case, each displaced prefetch will also displace useful data. Finally, we replace the number of prefetch fills () in equation () with , the total number of prefetches that arrive early enough in the cache. After making these substitutions, we note that prefetching into the primary cache is worthwhile whenever the following inequality holds:
By rewriting equation (), we can express the maximum value of (the average processor stall on a prefetch fill) at which prefetching into the primary cache is advantageous:
Equation () shows that in the absence of primary cache conflicts due to prefetches (i.e. ), it is worthwhile to prefetch into the primary cache as long as the average stall time due to primary cache tag contention () is less than the primary-to-secondary miss latency (). This condition should normally be satisfied since filling the primary cache is almost always a subset of the time it takes to service a primary-to-secondary miss. As the number of primary cache conflicts induced by prefetching () increases, the maximum acceptable value of will decrease.
Since primary cache conflicts are an important concern, we rewrite equation () once again, this time solving for :
Notice from equation () that in the absence of primary cache tag contention (i.e. ), the break-even point occurs when half of the prefetches arriving in the cache early enough () cause cache conflicts. In this case, half of the prefetches would be successful (thereby eliminating a cache miss), while the other half would displace useful data (thereby adding a new cache miss), and both effects would exactly cancel each other out. As the amount of contention () increases, it decreases the acceptable number of conflicts ().
To predict whether prefetching into the primary cache is worthwhile for the uniprocessor architecture used in Chapter , we can solve equation () given the parameter values of that architecture: cycles, and cycles. (Note that this is an upper bound on since the processor does not necessarily stall for the entire duration of a primary prefetch fill, which is 4 cycles in this case.) Substituting these values into equation (), we get the following:
Therefore we would expect prefetching into the primary cache to be worthwhile as long as fewer than a third of the prefetches suffer primary cache conflicts. The actual values of and are shown for each application in Table . We see that the values of ranges from 1.2 to 2.5 cycles, which is about half of the full primary fill time of four cycles. This makes sense since loads and stores typically occur at most once every other instruction. We also see from Table that the only application where cache conflicts potentially occur too often is CHOLSKY, where 37%of prefetches placed in the primary cache are displaced before they can be referenced. However, since CHOLSKY also happens to suffer less than average from primary cache contention (), the number of displaced prefetches is below the maximum for breaking even (which happens to be 44%for this case). Also, the number of additional misses caused by conflicting prefetches is overstated in this case, since only about half of the displaced prefetches were displaced by data references (the other half were displaced by other prefetches, and therefore did not incur additional misses)-i.e. for CHOLSKY. Hence the analytical model predicts that for the given architecture, prefetching into the primary cache should be a performance win for each of these applications.
To evaluate whether this is true in practice, we simulated the performance of each uniprocessor application, this time only prefetching into the secondary cache. The results of that experiment are shown in Figure . As we see in this figure, prefetching into the primary cache is in fact a performance win in all cases. In seven of the thirteen cases, there is a speedup of at least 19%due to prefetching into the primary cache rather than just the secondary cache.
The marginal improvement of prefetching into the primary cache depends on the relative frequency and cost of primary misses that are found in the secondary cache versus misses that go all the way to memory. If the data tends to fit in the secondary cache, then the entire benefit of prefetching will come from prefetching into the primary cache. In the other extreme, if none of the primary misses are found in the secondary cache, then the marginal improvement will be roughly the ratio of the primary-to-secondary and primary-to-memory miss latencies (which would be 12/75 = 16%for this architecture).
To help characterize where the data is being found, Table presents a breakdown of exactly where the data is found both by the prefetch and by the subsequent reference. When the value of the ``C C'' category (i.e. cases where the data are successfully prefetched from the secondary to the primary cache) is large relative to the sum of the ``M C'' and ``M C'' categories (i.e. cases where data are prefetched from memory, and are found in either the primary and secondary cache, respectively), we would expect prefetching into the primary cache to account for most of the gains. This agrees with the data in Figure . For example, GMTRY has the largest value in the ``C C'' column (73.5%), and also shows the largest marginal speedup (45%). In contrast, the prefetched misses in TOMCATV are twice as likely to be found in memory rather than the secondary cache, and therefore the marginal speedup for TOMCATV is considerably smaller (4%).
For the experiments in Figure , we did not recompile the code to take into account the larger size of the secondary cache relative to the primary cache (256K vs. 8K). Compiling for the larger cache size might help reduce instruction overhead by eliminating prefetches of data that reside in the secondary but not the primary cache. However, notice that there are only a few cases in Figure where the instruction overhead could potentially offset the marginal difference in reduced memory stalls. To see whether this would make a difference, we recompiled each program, this time increasing the effective cache size parameter to 256 Kbytes. The results of this experiment are shown in Figure .
As we see from Figure , the results of recompiling the code to take into account the larger cache size were mixed. In five of the cases (MXM, EMIT, VPENTA, OCEAN, and EP) the performance improved, while in six of the other cases (CFFT2D, CHOLSKY, BTRIX, TOMCATV, CG, and MG) the performance was worse. Of the cases that did improve, the benefit was typically small (never more than 5%). In six of the cases the instruction overhead actually increased (due to the overheads of peeling loops to account for temporal locality), despite the fact that fewer prefetches were issued. Of the four cases where there was a significant potential to reduce instruction overhead (BTRIX, VPENTA, TOMCATV and MG), only one case improved: VPENTA. However, even in this case, much of the gain in reduced instruction overhead was offset by larger memory stall times, since a number of the prefetches that were eliminated were in fact useful for reducing latency. In the case of BTRIX, we see that compiling for a larger cache size does not reduce the instruction overhead (it merely eliminates some useful prefetches), and given that these prefetches are being issued, it is clearly preferable that they do something useful (i.e. prefetch data into the primary cache).
Part of the reason for these mixed results is that it is more difficult to analyze locality in a very large cache, since the approximations inherent in using a fully-associative cache model in the locality analysis become more exaggerated. Also, in many cases the loop bounds are symbolic, and in these cases it does not matter how large the finite cache parameter is-all that matters is the policy for dealing with unknown loop bounds. Since it is unlikely that the compiler will be able to avoid prefetching data already in the secondary cache, we may as well get some performance benefit from these prefetches by moving the data into the primary cache. As secondary caches continue to grow, this should be even more true. Ultimately, however, it is the difference in memory stall reduction, not overheads, that shows that prefetching into the primary cache is the right thing to do.
Although the results presented in this section are specific to this architecture, we would expect the conclusions to remain applicable in the future for the following reasons. First, the average stall due to primary cache tag contention () should remain smaller than the primary-to-secondary miss latency (). Also, as cache sizes continue to increase, there may be fewer primary cache conflicts (), and more of the data set may be captured by the secondary cache, making it more important to prefetch primary-to-secondary cache misses. Therefore it is likely that prefetching into the primary cache will remain a performance win.