Many commercial RISC microprocessors have direct-mapped primary caches. This is because direct-mapped caches have faster access times than set-associative caches, and the difference in speed often outweighs the difference in miss rate for general-purpose codes. General-purpose codes (e.g., the C compiler) have relatively few mapping conflicts on direct-mapped caches because their access patterns tend to be randomly distributed throughout the cache. In contrast, scientific codes that stride through matrices (particularly ones dimensioned to powers of two) can have pathologically bad mapping conflicts. In four of the applications we studied (MXM, CFFT2D, VPENTA, and TOMCATV), the cache conflicts in the original code were so bad that we manually realigned the data to help alleviate these problems.
The cache conflict problem can be addressed either in hardware or in software. One software-based approach is to place the burden of avoiding conflicts on the programmer. Once the programmer is aware that conflicts are a problem, the solution to the problem is often quite obvious. In the applications where we fixed conflicts by hand, we simply added thirteen (an arbitrary prime number) to the size of each matrix dimension that was a power of two. This was enough to prevent elements with similar access functions in adjacent rows or matrices from mapping to the same cache entries. Although this is fairly straightforward, ideally the programmer should not have to worry about this. A more appealing software-based approach would be for the compiler to automatically realign data to fix mapping problems. However, the difficulty here is that once the data are moved, every possible reference to the data must also be adjusted appropriately. This may be particularly difficult given explicit address arithmetic, pointers, aliasing, or compilation across separate files. So although eliminating the problem in software may sound like the ideal approach, it is unclear whether this is a practical solution.
Cache conflicts can be addressed in the hardware through associativity of some form. While associativity has the advantage of reducing conflicts by allowing locations to map to multiple cache entries, it has the disadvantage of slowing down cache access rates due to added complexity. Therefore minimizing the degree of associativity is an important concern. One option is to have a set-associative primary cache, where addresses are mapped to N-way associative sets. Another approach is to keep the cache direct-mapped but also add a ``victim cache'' [40] off to the side. A victim cache is a small buffer containing the last several lines replaced from the cache. While set-associative caches are the most common approach for adding associativity, the victim cache approach is appealing because it is tailored to cases where data are reused shortly after being displaced-this is precisely what happens with prefetching conflicts.
We evaluate both of these techniques in this subsection by incorporating them
into our uniprocessor architecture. The applications we consider are the ones
that suffered most noticeably from conflicts, including two cases where
prefetches were often ineffective due to conflicts (CHOLSKY and TOMCATV, as
discussed in Section ), and the four original
codes before we manually realigned data (MXM, CFFT2D, VPENTA, and TOMCATV). The
results of these experiments are shown in
Figures
through
.
Each graph shows curves both with and without prefetching (PF and NOPF, respectively) for victim caches ranging from one to 256 entries, and for set-associative caches ranging from 2-way to 8-way associativity. Two curves are shown for the set-associative cases: one with random replacement (Random), and one with a least-recently-used (LRU) replacement policy. The LRU policy is slightly more effective, but is typically more expensive to implement.
The success of victim caching and set-associativity varied across the applications, and we discuss each case individually below.
To insert prefetches, the compiler unrolls loop L in
Figure four times for the spatial
locality of the A reference, and prefetches are software-pipelined
three iterations ahead. Therefore the B references are being
prefetched twelve references ahead, which spans 384 bytes (i.e. twelve
separate 32-byte cache lines). This region
is large enough that the B references overlap when they are close
enough together. What happens then is that the prefetch of B(I,L,K+JJ) displaces the prefetch of B(I,L,K), and the load of
B(I,L,K) displaces the prefetch of B(I,L,K+JJ), thereby
rendering both prefetches ineffective. These conflicts only occur for
certain values of JJ, and they happen about 25%of the time.
As we see in Figure , a 2-way
set-associative cache (with LRU replacement) is enough to eliminate this
conflict problem. A victim cache only eliminates the problem when it is
32 entries deep. This is because we prefetch each reference twelve
iterations ahead, and up to 24 conflicting misses must be captured.
This type of conflict pattern is better suited to associativity.
We begin by focusing on the original TOMCATV code, as shown in
Figure . For each of these cases where
we show original code before the arrays are manually realigned (i.e. TOMCATV,
MXM, CFFT2D, and VPENTA), we also indicate the performance of the code after realignment using dashed and dotted horizontal lines, which correspond
to the cases with and without prefetching, respectively, on the original
direct-mapped architecture. We observe that
either eight victim cache entries or an 8-way set-associative cache
are needed to match the performance of simply realigning the arrays.
Once the arrays are realigned, as shown in Figure
,
a 2-way set-associative cache is
quite helpful, and there is steady improvement from increasing the victim
cache size up to sixteen entries. Large degrees of both victim caching
and set-associativity
fare well in the case of TOMCATV, but clearly realigning the arrays is the
most important optimization.
Unlike associativity, victim caching performs very well in this case, as
we see in Figure . A
single victim entry allows all five references to remain in the cache,
thereby dramatically improving the NOPF case. The prefetching case is
fetching data two iterations ahead, and does somewhat worse than NOPF
with a single victim entry because it replaces the crucial victim entry
with data displaced by prefetching. With eight or more victim cache
entries, we finally see the full prefetching benefit, since both the
current and the prefetched data sets can be captured. We also see that
by restructuring the code by hand, cache conflicts can be avoided
altogether, and this matches the NOPF performance with a single victim
cache entry. The performance of the restructured code with prefetching
is only exceeded by eight or more victim entries, where it is possible
to avoid occasional conflicts with other references.
In the prefetching case, this second loop is unrolled by a factor of two
and we prefetch three iterations ahead. When the references line up,
obviously the prefetches will also interfere with each other.
Consequently the prefetch of X(K,IM) was often replaced by the
prefetch and reference of X(K,II) before it could be used.
However, the more important conflict cases with prefetching are in the
first loop nest (Figure (a)). Here we
prefetch six iterations ahead along the outer dimension. Since each of
these accesses is separated by 2 Kbytes, they map into only four entries
in the 8 Kbyte direct-mapped cache, so each reference ends up displacing
its own prefetches. Six or more victim cache entries are enough to
capture these conflicts, and we see a noticeable performance improvement
in Figure
with eight entries. An
eight entry victim cache does better than 8-way set-associativity in
this case because all of the prefetches map into the same
set, and the replacement policy is not perfect. The manually-realigned
code with a direct-mapped cache performs better than an 8-way
set-associative cache and as well as a victim cache with eight or more
entries.
To summarize these results, we have seen a mixture of effects. In cases like CHOLSKY where a small number of references conflict only because their prefetched data sets overlap, a small amount of set-associativity is the best approach. Victim caches do not perform as well in these cases where the depth of software pipelining is large, since the replaced data must be held across that many iterations (i.e. there is a multiplicative effect of degree of conflict times pipelining distance). In other cases such as MXM where the conflicting data maps into the same sets in a set-associative cache, victim caching is better suited to the problem. In other cases, the same degree of associativity (either N-way set-associativity or an N-entry victim cache) works equally well.
In a few cases we noticed two dips in the performance curves. The first dip corresponds to holding the ``active'' working set in either the primary cache or the victim cache (e.g., two victim entries in MXM and CFFT2D), and the second dip occurs when the prefetched active set can also be held at the primary cache level (e.g., eight victim entries in MXM and CFFT2D). When conflicts already occur (e.g., the original VPENTA code), prefetching may be doomed since the prefetched data will also conflict. However, prefetching may cause new conflicts, especially when striding through arrays that are only slightly separated in the cache.
Since prefetch-specific conflicts occur within a certain window of time, the victim cache approach sounds appealing, since it is tailored to capturing recently displaced data. However, with latencies of over a hundred cycles, more than sixteen victim cache entries may be needed, which is probably too large for an associative lookup.
In all cases, the software-based restructuring approach did quite well. Our restructuring algorithm was simply to add a prime number (thirteen) to the size of array dimensions that were a power of two. While care must be taken with this approach, it worked quite well and has the added benefit of reducing conflicts in the secondary cache. In one case (VPENTA), no reasonable amount of hardware could solve the cache conflict problems.