As in the uniprocessor study, the effectiveness of the scheduling algorithm
can be evaluated based on how often prefetched misses result in primary
cache hits. Figure breaks down what happened to the
original primary cache misses, similar to Figure
in
the uniprocessor study. However, since there are more explanations for
ineffective prefetches in the multiprocessor architecture, we have broken
the misses down into several more specific categories.
The topmost section of the bars in Figure (labeled
nopf-miss) is still the cases that are not prefetched and obviously
remain cache misses. These are the misses that are not part of the coverage
factor, which was discussed in the previous subsection. The bottom
section of the bars (labeled pf-hit) includes the original misses
that are prefetched and subsequently become primary hits, which corresponds
to effective scheduling. The remaining four cases in the middle of the bars
(labeled pf-miss) are the ones where prefetches are not effective. We
will discuss each of these cases in detail.
The ``pf-miss: too late'' category includes cases where the
prefetches were not issued early enough to hide the latency, and therefore
the data item had not returned to the cache before it was referenced. This
case was only noticeable in one application: LU. The problem in LU is that
prefetches are delayed due to ``hot-spotting'' in the network. The
hot-spotting occurs because several processors are often waiting for the
same column to be produced by a single processor. Once that processor
signals that the column is ready, the other processors simultaneously send
numerous prefetches to that processor, requesting the data. The prefetches
subsequently exceed the bandwidth capabilities of that particular network
node, resulting in queueing delays. This problem is not remedied by
simply scheduling the software pipeline to hide a larger latency, since the
prefetches cannot be issued before the synchronization point, and after
that point the request rate simply exceeds the bandwidth capacity of the
network node. In fact, when LU was recompiled for twice the normal target
memory latency (i.e. 600 rather than 300 cycles), the performance was
considerably worse. So to some extent these delays are
unavoidable.
The ``pf-miss: invalidated'' category includes the cases where the
prefetched data item is brought into the cache but is invalidated before it
can be referenced. This situation arises because we use non-binding
prefetches, and therefore sometimes the data must be invalidated to
preserve correctness. This category was most prominent in MP3D, where it
accounted for roughly half of the pf-miss cases. In MP3D, some of the
objects that are prefetched (the ``space cells'') are very actively shared
and modified amongst the processors. Therefore in many cases another
processor wants to modify a location between the time it is prefetched and
referenced. This effect is magnified to some extent by the fact that we use
exclusive-mode prefetches, which means that a processor can
invalidate other copies of the line when it expects to write the line in
the near future, not only when it is actually modifying the line. The
misses in the ``pf-miss: invalidated'' category tend to be
expensive, since the data is typically found far from the processor (i.e.
rather than being in the home memory, it is usually in a remote processor's
cache). We see that this is true in Table ,
where the average read miss latency for MP3D is only reduced from 90.1 to
83.0 cycles, which is a meager reduction compared to the other
applications. Although these invalidations are an interesting effect, the
prefetches that are successful still manage to eliminate 82%of the miss
stall time in MP3D. The fact that the invalidations allow the prefetches to
be non-binding is a benefit that far outweighs the cost of prefetches that
are occasionally unsuccessful due to invalidations.
The ``pf-miss: replaced'' category includes cases where the prefetched data is brought into the primary cache but is subsequently replaced from both the primary and secondary caches before it can be referenced. This case did not occur often for any of the applications.
Finally, the ``pf-miss: in-scache'' category includes cases where prefetched data is replaced from the primary cache before it can be referenced, but the data is still found in the secondary cache. Fortunately this is the dominant case for all the applications, as it was in the uniprocessor study. This is the least harmful of the pf-miss categories since the latency of a miss to the secondary cache is small relative to other types of misses, and in many cases the latency of an expensive miss has been reduced to a relatively cheap miss (i.e. it represents partial latency-hiding). For example, the majority of misses in this category for MP3D were formerly ``dirty remote'' misses, which have latencies of at least 132 cycles. Therefore the latency is reduced by roughly an order of magnitude when the data are found in the secondary cache.
Although the ``pf-miss: in-scache'' category is the least harmful of the pf-miss categories, it is still obviously less desirable than the pf-hit category. In the case of CHOLESKY, only 10%of the prefetched primary misses became primary hits-the other 90%are found in the secondary cache. Unfortunately, in this case nearly all of those latter misses were in the secondary cache to begin with, so there is no partial latency-hiding benefit. On the other hand, most of the misses that became primary hits were originally expensive remote misses. All of these prefetches are for the same structures: the panels. Once a panel is ``ready'', it is repeatedly applied to other panels through a procedure call. When a panel is first used, the prefetches within the procedure succeed in fetching it from a remote location into the primary cache. However, on subsequent calls to this procedure, the prefetches for that same panel fail to bring it from the secondary to the primary cache due to conflicts with other references. These conflicts are with other panel references in the key loop, where eight separate columns within the panel are referenced on each iteration. To some extent, the conflict problem is also aggravated because we schedule for the worst-case latency-although this allows prefetching to work for the expensive misses (which is important), it causes the data to arrive in the primary cache unnecessarily early when found in the secondary cache. Treating these cases differently is difficult since they are simply different calls to the same procedure. It is unfortunate that scheduling is not more effective for CHOLESKY, considering how well locality analysis performs: only 15%of the prefetches are unnecessary and the coverage factor is 85%. The performance of LOCUS is also hindered by a similar conflict problem, but to a lesser extent.
To summarize, we have seen that the scheduling algorithm is often quite
successful, which results in substantial speedups for OCEAN, LU, and MP3D.
However, there is room for further improvement in CHOLESKY and LOCUS by
reducing cache conflicts, which we will address later in
Section . Despite these conflicts, which
replace data from the primary to the secondary cache, the benefit of hiding
remote latencies resulted in the elimination of at least 30%of memory
stall cycles in all cases.