In this section, we study the effects of varying the cache size. Our motivation is twofold. First, we would like to build additional confidence in our results, given our cache scaling methodology. Recall from Section that in order to achieve realistic problem-size to cache-size ratios, the results presented so far in this chapter are based on an 8K/64K cache hierarchy, which is scaled down from the full 64K/256K caches in DASH. By varying the cache size, we can explore the effects of cache scaling on performance. Second, by running the same problem on varying cache sizes, we effectively change the mixture of misses due to replacements (because of the finite cache size) versus misses due to coherence (because of the communication of shared data). This allows us to evaluate how well our compiler algorithm prefetches both types of misses.
The results of our experiments are presented in Figure , where the performance of each application is shown on three different sets of cache sizes. The middle pair of bars are for the scaled cache sizes of 8K/64K that we used throughout this chapter. The rightmost pair of bars are for the full-sized 64K/256K caches in DASH. The leftmost pair of bars are for a smaller cache hierarchy of 2K/8K. We begin by focusing on how the code without prefetching performs across the different cache sizes, and then later compare this to the behavior of the code with prefetching.
If we focus our attention first on the code without prefetching, we see a variety of different behaviors in Figure . At one extreme is MP3D, where varying the cache size has virtually no impact on performance. This is because MP3D sweeps through a very large data set on each time step, and even the 64K/256K caches are not large enough to retain the data (each processor's particles are 200 Kbytes, and the space cell array is 288 Kbytes). This region, where the the data set is much larger than the cache size, is expected to be the case in real-life runs of MP3D, so our results are indicative of the performance on real machines. At the other extreme are applications like LU, CHOLESKY, and LOCUS, where there is a noticeable knee in performance once a key data structure fits in the cache. For LU, this knee occurs between the 8K/64K and 64K/256K cache sizes, and corresponds to when the columns owned by a processor (roughly 20 Kbytes) fit in the primary cache. For both CHOLESKY and LOCUS, the knee occurs between the 2K/8K and 8K/64K cache sizes. In the case of CHOLESKY, this corresponds to an entire panel fitting in the primary cache. For LOCUS, it corresponds to the portion of the cost array where a wire is being routed fitting in the cache. Therefore we see dramatic performance swings (more than 80%improvement) as the knee is crossed in three of the five cases without prefetching.
In contrast, the performance of the code with prefetching varies more gradually over the range of cache sizes. For example, while the performance of LU without prefetching improves by 82%when the cache sizes increase from 8K/64K to 64K/256K, the performance with prefetching changes by only 7%. Similarly, the performance improvement as the cache sizes are increased from 2K/8K to 8K/64K is 156%without prefetching vs. 14%with prefetching for CHOLESKY, and 85%without prefetching vs. 37%with prefetching for LOCUS. Why is the prefetching performance less sensitive to cache size? As the cache size is reduced, we see more replacement misses in the code without prefetching. However, since our algorithm is effective at prefetching replacement misses (as demonstrated already in Chapter ), the resulting latency is therefore hidden, thus making the code with prefetching less sensitive to cache size. In fact, in three of the five cases (OCEAN, LU, and MP3D), the performance of code with prefetching on 2K/8K caches was better than code without prefetching on 64K/256K caches.
We now consider how cache size variations affect the relative performance improvement offered by prefetching. In general, Figure shows that prefetching improves performance more with smaller cache sizes. The obvious reason for this is that smaller caches have more latency to hide due to replacement misses, and thus there is more room for improvement. For example, prefetching offers only modest speedups for CHOLESKY and LOCUS on 8K/64K caches, but provides dramatic speedups on 2K/8K caches (139%and 48%, respectively). With these smaller caches, the prefetches for the panels in CHOLESKY and the cost array in LOCUS are almost always useful; with larger caches, these structures only miss during communication, and therefore the prefetches are not as beneficial. As evidence of this reduced benefit, the instruction overhead offsets much of the reduction in memory stalls for both CHOLESKY and LOCUS with 64K/256K caches.
Can these instruction overheads with large caches be reduced by prefetching only the coherence misses? For CHOLESKY, the answer is yes, but only if special techniques for handling procedures are used. The difficulty in CHOLESKY is that each access to a panel occurs as a separate call to the same procedure. To isolate only the coherence misses, we would like to prefetch a panel only the first time it is accessed by a given processor. This cannot occur normally, but would be possible with either inlining or procedure cloning [15]. (Note that a very similar situation occurs in LU.) For LOCUS, however, the answer is no, coherence misses cannot be isolated. The reason why is that coherence misses in LOCUS only occur when another processor routes a wire in the same portion of the cost array as the given processor. Since these misses are unpredictable and occur sporadically, they cannot be isolated, and therefore we must conservatively prefetch the cost array all the time.
In summary, we have seen that the performance advantages of prefetching often remain significant even as the cache size is varied. This robustness occurs despite the fact that we conservatively assume that data does not remain in the cache across synchronization statements (see Section ). Since our algorithm effectively handles replacement misses, the performance with prefetching is relatively insensitive to the cache size, unlike code without prefetching, where performance can suffer greatly once key data structures no longer fit in the cache. Once the caches are large enough that little latency remains, it becomes increasingly important to minimize instruction overhead by prefetching only the coherence misses. However, in some cases it is impossible to isolate only the coherence misses, since they cannot be predicted, and therefore the compiler is forced to be conservative. In such cases, profiling feedback information may potentially be useful, as we will discuss later in Section .