Next:
Introduction
Up: Tolerating Latency Through Software-Controlled
Previous: List of Tables
- Speed of commercial microprocessors and commodity DRAM over the
past decade.
- Data locality example.
- Example of a more complicated access pattern that can be
handled by reuse analysis.
- Example of non-uniformly generated references.
- Example of uniformly generated references that do not have
reuse.
- Example of references that access the same cache lines
despite never accessing the same data items.
- Example of how loop iteration counts and cache size affect
locality.
- Algorithm for computing the localized iteration space.
(Continued on next page.)
- Algorithm for computing the localized iteration space. (Continued from
previous page.)
- Example of algorithm for estimating volume of data accessed by
each loop.
- Example of how symbolic values can be useful when computing
volume of data accessed by each loop.
- Example of references with group reuse but not group locality.
- Example of how prefetch predicates are constructed.
- Generic schema for peeling a loop.
- Generic schema for unrolling a loop.
- Generic schema for strip-mining a loop.
- Algorithm for computing the shortest path through a loop body.
- Example of how software pipelining is used to schedule
prefetches the proper amount of time in advance. For this example,
assume that 5 iterations are enough to hide memory latency.
- Example of selective prefetching algorithm. (Continued on next page.)
- Example of selective prefetching algorithm. (Continued from previous page.)
- Overall performance of the selective prefetching algorithm (N =
no prefetching, and S = selective prefetching).
- Overall performance comparison between the indiscriminate and
selective prefetching algorithms (I =
indiscriminate prefetching, and S = selective prefetching).
- Statistics for evaluating locality analysis for the uniprocessor
applications (I = indiscriminate prefetching, and S = selective
prefetching). Note that the unnecessary prefetch percentages are
computed with respect to the number of prefetches issued, which
changes between the two cases.
- Loop splitting effectiveness (N = no prefetching, C =
selective prefetching with conditional statements,
and S = selective prefetching with loop splitting).
- Breakdown of the impact of prefetching on the original primary
cache misses for the uniprocessor applications.
- Sensitivity of results to compile-time parameters (N = no prefetching,
S = selective prefetching variations).
- Results with locality optimization (N = no prefetching,
I = indiscriminate prefetching, and S = selective prefetching).
- Example of an indirect array reference.
- Example of how software pipelining is used to prefetch indirect
references. For this example, assume that 5 iterations are enough to
hide memory latency.
- Example of how prefetching multiple levels of indirection may
result in invalid addresses and possibly a load exception. Assume that
5 iterations are sufficient to hide memory latency.
- Prefetching indirect references in the uniprocessor
applications (N = no prefetching, D =
dense-only prefetching, and B = both dense and indirect prefetching).
- Example of when a binding prefetch would be illegal.
- Example of how coherence activity can cause cache misses.
- Example containing explicit synchronization.
- Illustration of how exclusive-mode prefetching improves performance.
- Architecture and processor environment.
- Performance of multiprocessor applications without prefetching.
- Overall performance of the selective prefetching algorithm
for the multiprocessor applications (N =
no prefetching, and S = selective prefetching).
- Overall performance comparison between the indiscriminate and
selective prefetching algorithms for the multiprocessor applications (I =
indiscriminate prefetching, and S = selective prefetching).
- Statistics for evaluating locality analysis for
multiprocessor applications
(I = indiscriminate prefetching, and S = selective
prefetching).
- Breakdown of the impact of prefetching on the original primary
cache misses for the multiprocessor applications.
- Prefetching indirect references in the multiprocessor version of MP3D
(N = no prefetching, D = dense-only prefetching, and
B = both dense and indirect prefetching).
- Performance with and without exclusive-mode prefetching given
sequential consistency rather than release consistency
(S = shared-mode prefetching only, X = exclusive-mode
prefetching available). Performance is normalized to release
consistency without prefetching.
- Performance of multiprocessor applications with varying
cache sizes (N = no prefetching, and S = selective
prefetching).
- Comparison of compiler-based and hand-inserted prefetching for
the cases where the compiler succeeded (N = no prefetching,
A = prefetches inserted automatically by compiler, H
= hand-inserted prefetching).
- Cases where the compiler failed to improve prefetching (N
= no prefetching, H = hand-inserted prefetching).
- Format of prefetch instructions, using ``base-plus-offset''
addressing mode.
- Example of how prefetches can reuse load/store base registers
(in this case r7).
- Possible encoding of prefetch instruction.
- Dropping vs. stalling on full prefetch issue buffer (D = drop,
S = stall).
- Performance when prefetches do not check either caches
before going to memory (M = go straight to memory, C =
check caches).
- Performance when prefetching into the primary cache (1)
versus prefetching only into the secondary cache (2).
- Performance when prefetching into the secondary cache, both
when compiled for the primary cache size (O), and when recompiled
for the secondary cache size (R).
- Prefetch issue buffer in the uniprocessor architecture. Note
that for prefetches, the fetched data goes directly into the cache,
rather than being held in an MSHR. Therefore an MSHR is simply a
resource for controlling an outstanding miss under this model.
- Distribution of previously outstanding prefetch misses upon
each prefetch miss for the original uniprocessor architecture (which
supports up to seventeen outstanding misses).
- Performance when the number of MSHRs is varied.
- Performance when the size of the prefetch issue buffer size is
varied between four, eight, and sixteen entries, given four MSHRs.
- Example of why intrinsic data reuse matters when scheduling
prefetches even when miss rates are precisely known.
- Results using feedback (N = no prefetching, S =
prefetching with static analysis only, F = prefetching with
feedback).
- Example of adapting at runtime by checking problem size.
- Example of adapting at runtime by checking for data alignment
conflicts.
- Example of adapting at runtime by checking hardware miss counters.
- Example of adapting at runtime to temporal locality along an
outer loop by checking hardware miss counters.
- Results with adaptive version of BCOPY (N = no
prefetching, S = statically prefetch all the time,
D = adapt prefetching dynamically). ``BxT'' means the
same B-byte block is copied to the same destination T times.
Performance is renormalized for each case.
- Results with adaptive version of LU (N = no
prefetching, S = statically prefetch all the time,
D = adapt prefetching dynamically). Performance
is re-normalized for each cache size.
- Performance of CHOLSKY with victim caches and set-associative
primary caches.
- Loop that suffers cache conflicts in CHOLSKY.
- Performance of the original TOMCATV code with victim caches and set-associative
primary caches. The performance of TOMCATV on the original direct-mapped architecture
after arrays are manually realigned is shown by the dotted (no prefetching) and
dashed (with prefetching) horizontal lines.
- Performance of the realigned version of TOMCATV with victim caches and set-associative
primary caches. Note that performance is normalized to the speed of this realigned code.
- Performance of the original MXM code with victim caches and set-associative
primary caches. The performance of MXM on the original direct-mapped architecture
after arrays are manually realigned is shown by the dotted (no prefetching) and
dashed (with prefetching) horizontal lines.
- Loop that suffers cache conflicts in MXM.
- Performance of the original CFFT2D code with victim caches and set-associative
primary caches. The performance of CFFT2D on the original direct-mapped architecture
after arrays are manually realigned is shown by the dotted (no prefetching) and
dashed (with prefetching) horizontal lines.
- Loops that suffer cache conflicts in CFFT2D.
- Performance of the original VPENTA code with victim caches and set-associative
primary caches. The performance of VPENTA on the original direct-mapped architecture
after arrays are manually realigned is shown by the dotted (no prefetching) and
dashed (with prefetching) horizontal lines.
- Loop that suffers cache conflicts in VPENTA.
- Example where it is not clear whether to use uncached prefetches.
- Example of how instruction overhead can be eliminated by
issuing large block prefetches outside the main loop.
- Example where imperfect branch prediction makes it
difficult to look far enough ahead in the instruction stream.
- Performance with sequential consistency (SC) versus
release consistency (RC), normalized to RC without
prefetching (N = no prefetching, S = selective
prefetching).
- Performance of multithreading with 1, 2 and 4 contexts and
switch latencies of 4 and 16 cycles.
- Effect of combining multithreading with prefetching
(multithreading schemes have a 4-cycle switch latency).
0