We begin our investigation by evaluating multithreading in its own right. Later we will examine the benefits of combining multithreading with prefetching.
Figure shows performance results for 1, 2, and 4-context processors with context switching penalties of 4 and 16 cycles. Each bar in the graphs is broken down into the following components: time spent executing instructions, time spent switching between contexts, and time when the processor is idle. The idle time is broken down further into all idle time which is when all contexts are idle waiting for a reference to complete, and no switch time which represents time when the current context is idle but is not switched out. Most of the latter idle time is due to the fact that the processor is locked out of the primary cache while fill operations of other contexts complete.
Most of the applications benefited from multithreading. The noteworthy exceptions are CHOLESKY and PTHOR, where the performance is worse with four contexts than with a single context. The reason for this is that these two applications do not scale well to 64 processes, and therefore the processes spend too much time spinning waiting for work. This extra spinning time can be seen as the increase in the instruction category in Figure .
To provide some insight into these results, Table shows the median run length and average primary miss latency for each application. A rough estimate of the number of contexts necessary to hide memory latency is the miss latency divided by the run length. For example, MP3D has one of the more favorable ratios (roughly two-to-one), which helps explain why two contexts eliminate a large fraction of the idle time. In contrast, OCEAN has a ratio of more than three-to-one, which helps explain why two contexts eliminate only part of the idle time.
However, a favorable run-length-to-miss-latency ratio does not ensure good performance. For example, in BARNES this ratio would suggest that two contexts would be sufficient to hide the latency, but in fact only about half of the all-idle time is eliminated. The reason for this is the clustering of cache misses. Also the cache miss rates can deteriorate as the different contexts compete for the same cache; we observe this effect in LOCUS, where the primary data miss rate more than doubles from 14%to 30%as we go from one to four contexts.
The importance of minimizing the context switch latency varies depending on whether there is frequently another ready-to-run context during a context switch. On the one hand, when some of the applications are run with only 2 contexts (e.g., OCEAN, LU, and PTHOR), there typically is not a ready-to-run context during a context switch, and therefore reducing the switch penalty from 16 to 4 cycles has little impact on performance. On the other hand, the switch penalty does affect performance significantly in most cases with 4 contexts, and even in some cases with 2 contexts (e.g., CHOLESKY and LOCUS). Therefore, given that there are enough contexts to hide the latency, it is important to minimize the context switch latency.
To summarize, we see that multithreading can increase performance significantly when the run length to latency ratio is favorable. However, enough parallelism must be available in the application to keep the additional contexts busy. We further observe that destructive interference of the contexts in the processor cache can undo any gains achieved. Interference is more of a problem with multithreading than with prefetching because multiple working sets interfere with each other in the same cache. The smaller the number of cycles required for context switching, the lower the total overhead due to multithreading. A context switch cost of 16 cycles introduces significant overhead, whereas the overhead is much more reasonable with a 4-cycle switch penalty.