Finally, let us consider the combined effect of multithreading and prefetching. The main benefit of combining the two, of course, is that each scheme can compensate for the other scheme's weaknesses. For example, prefetching can increase the hit rate, thus increasing the run lengths and ensuring that a small number of contexts suffice. Similarly, multithreading can ensure that the processor does not remain idle for misses where prefetching was not effective. However, the two schemes can also have negative interactions. First, both prefetching and multithreading add overhead. So if the latency of a reference could be totally hidden by one scheme alone, the second one only contributes overhead. Secondly, the two techniques may interfere with each other. For example, when multiple contexts are used, the time between issue and use of a prefetch may increase substantially, thus increasing the chance of the prefetched data being invalidated or replaced from the cache before being referenced. Depending on the relative magnitudes of the above effects, the performance of an application may increase or decrease when both schemes are used.
Figure shows the performance of multithreading, both with and without prefetching (all of these results are for the four cycle switch penalty). As we see in this figure, the results are mixed. In some cases the best overall performance is with four contexts and no prefetching (LU, MP3D, LOCUS, and BARNES), in other cases it is with prefetching and a single context (OCEAN, PTHOR), and in one case it is through the combination of prefetching and two contexts (CHOLESKY). With four contexts, the negative effects of combining prefetching and multithreading appear to dominate. When only two contexts are used, the addition of prefetching nearly always improves performance. In fact, with only two contexts, the best overall performance in four of the seven cases (LU, MP3D, CHOLESKY, and LOCUS) is achieved by combining prefetching and multithreading.
In our study, when we added prefetching to the applications, we did not have multithreading in mind. In some cases this may have had a negative impact on the results for combining prefetching and multithreading. For example, for a single-context processor, it is reasonable to be quite aggressive and add prefetches in situations where we expect only a small portion of the latency to be hidden. This occurred most frequently in PTHOR and BARNES, where control dependencies make it difficult to move prefetches back far enough. However, for a multiple-context processor, this may be a bad decision. If the multithreading processor would have hidden the latency anyway, prefetch overhead has been added without any benefit.