List of Figures

Next: Introduction Up: Tolerating Latency Through Software-Controlled Previous: List of Tables

List of Figures

Speed of commercial microprocessors and commodity DRAM over the past decade.
Data locality example.
Example of a more complicated access pattern that can be handled by reuse analysis.
Example of non-uniformly generated references.
Example of uniformly generated references that do not have reuse.
Example of references that access the same cache lines despite never accessing the same data items.
Example of how loop iteration counts and cache size affect locality.
Algorithm for computing the localized iteration space. (Continued on next page.)
Algorithm for computing the localized iteration space. (Continued from previous page.)
Example of algorithm for estimating volume of data accessed by each loop.
Example of how symbolic values can be useful when computing volume of data accessed by each loop.
Example of references with group reuse but not group locality.
Example of how prefetch predicates are constructed.
Generic schema for peeling a loop.
Generic schema for unrolling a loop.
Generic schema for strip-mining a loop.
Algorithm for computing the shortest path through a loop body.
Example of how software pipelining is used to schedule prefetches the proper amount of time in advance. For this example, assume that 5 iterations are enough to hide memory latency.
Example of selective prefetching algorithm. (Continued on next page.)
Example of selective prefetching algorithm. (Continued from previous page.)
Overall performance of the selective prefetching algorithm (N = no prefetching, and S = selective prefetching).
Overall performance comparison between the indiscriminate and selective prefetching algorithms (I = indiscriminate prefetching, and S = selective prefetching).
Statistics for evaluating locality analysis for the uniprocessor applications (I = indiscriminate prefetching, and S = selective prefetching). Note that the unnecessary prefetch percentages are computed with respect to the number of prefetches issued, which changes between the two cases.
Loop splitting effectiveness (N = no prefetching, C = selective prefetching with conditional statements, and S = selective prefetching with loop splitting).
Breakdown of the impact of prefetching on the original primary cache misses for the uniprocessor applications.
Sensitivity of results to compile-time parameters (N = no prefetching, S = selective prefetching variations).
Results with locality optimization (N = no prefetching, I = indiscriminate prefetching, and S = selective prefetching).
Example of an indirect array reference.
Example of how software pipelining is used to prefetch indirect references. For this example, assume that 5 iterations are enough to hide memory latency.
Example of how prefetching multiple levels of indirection may result in invalid addresses and possibly a load exception. Assume that 5 iterations are sufficient to hide memory latency.
Prefetching indirect references in the uniprocessor applications (N = no prefetching, D = dense-only prefetching, and B = both dense and indirect prefetching).
Example of when a binding prefetch would be illegal.
Example of how coherence activity can cause cache misses.
Example containing explicit synchronization.
Illustration of how exclusive-mode prefetching improves performance.
Architecture and processor environment.
Performance of multiprocessor applications without prefetching.
Overall performance of the selective prefetching algorithm for the multiprocessor applications (N = no prefetching, and S = selective prefetching).
Overall performance comparison between the indiscriminate and selective prefetching algorithms for the multiprocessor applications (I = indiscriminate prefetching, and S = selective prefetching).
Statistics for evaluating locality analysis for multiprocessor applications (I = indiscriminate prefetching, and S = selective prefetching).
Breakdown of the impact of prefetching on the original primary cache misses for the multiprocessor applications.
Prefetching indirect references in the multiprocessor version of MP3D (N = no prefetching, D = dense-only prefetching, and B = both dense and indirect prefetching).
Performance with and without exclusive-mode prefetching given sequential consistency rather than release consistency (S = shared-mode prefetching only, X = exclusive-mode prefetching available). Performance is normalized to release consistency without prefetching.
Performance of multiprocessor applications with varying cache sizes (N = no prefetching, and S = selective prefetching).
Comparison of compiler-based and hand-inserted prefetching for the cases where the compiler succeeded (N = no prefetching, A = prefetches inserted automatically by compiler, H = hand-inserted prefetching).
Cases where the compiler failed to improve prefetching (N = no prefetching, H = hand-inserted prefetching).
Format of prefetch instructions, using ``base-plus-offset'' addressing mode.
Example of how prefetches can reuse load/store base registers (in this case r7).
Possible encoding of prefetch instruction.
Dropping vs. stalling on full prefetch issue buffer (D = drop, S = stall).
Performance when prefetches do not check either caches before going to memory (M = go straight to memory, C = check caches).
Performance when prefetching into the primary cache (1) versus prefetching only into the secondary cache (2).
Performance when prefetching into the secondary cache, both when compiled for the primary cache size (O), and when recompiled for the secondary cache size (R).
Prefetch issue buffer in the uniprocessor architecture. Note that for prefetches, the fetched data goes directly into the cache, rather than being held in an MSHR. Therefore an MSHR is simply a resource for controlling an outstanding miss under this model.
Distribution of previously outstanding prefetch misses upon each prefetch miss for the original uniprocessor architecture (which supports up to seventeen outstanding misses).
Performance when the number of MSHRs is varied.
Performance when the size of the prefetch issue buffer size is varied between four, eight, and sixteen entries, given four MSHRs.
Example of why intrinsic data reuse matters when scheduling prefetches even when miss rates are precisely known.
Results using feedback (N = no prefetching, S = prefetching with static analysis only, F = prefetching with feedback).
Example of adapting at runtime by checking problem size.
Example of adapting at runtime by checking for data alignment conflicts.
Example of adapting at runtime by checking hardware miss counters.
Example of adapting at runtime to temporal locality along an outer loop by checking hardware miss counters.
Results with adaptive version of BCOPY (N = no prefetching, S = statically prefetch all the time, D = adapt prefetching dynamically). ``BxT'' means the same B-byte block is copied to the same destination T times. Performance is renormalized for each case.
Results with adaptive version of LU (N = no prefetching, S = statically prefetch all the time, D = adapt prefetching dynamically). Performance is re-normalized for each cache size.
Performance of CHOLSKY with victim caches and set-associative primary caches.
Loop that suffers cache conflicts in CHOLSKY.
Performance of the original TOMCATV code with victim caches and set-associative primary caches. The performance of TOMCATV on the original direct-mapped architecture after arrays are manually realigned is shown by the dotted (no prefetching) and dashed (with prefetching) horizontal lines.
Performance of the realigned version of TOMCATV with victim caches and set-associative primary caches. Note that performance is normalized to the speed of this realigned code.
Performance of the original MXM code with victim caches and set-associative primary caches. The performance of MXM on the original direct-mapped architecture after arrays are manually realigned is shown by the dotted (no prefetching) and dashed (with prefetching) horizontal lines.
Loop that suffers cache conflicts in MXM.
Performance of the original CFFT2D code with victim caches and set-associative primary caches. The performance of CFFT2D on the original direct-mapped architecture after arrays are manually realigned is shown by the dotted (no prefetching) and dashed (with prefetching) horizontal lines.
Loops that suffer cache conflicts in CFFT2D.
Performance of the original VPENTA code with victim caches and set-associative primary caches. The performance of VPENTA on the original direct-mapped architecture after arrays are manually realigned is shown by the dotted (no prefetching) and dashed (with prefetching) horizontal lines.
Loop that suffers cache conflicts in VPENTA.
Example where it is not clear whether to use uncached prefetches.
Example of how instruction overhead can be eliminated by issuing large block prefetches outside the main loop.
Example where imperfect branch prediction makes it difficult to look far enough ahead in the instruction stream.
Performance with sequential consistency (SC) versus release consistency (RC), normalized to RC without prefetching (N = no prefetching, S = selective prefetching).
Performance of multithreading with 1, 2 and 4 contexts and switch latencies of 4 and 16 cycles.
Effect of combining multithreading with prefetching (multithreading schemes have a 4-cycle switch latency).

tcm@
Sat Jun 25 15:13:04 PDT 1994