To illustrate the need for improving the cache performance of microprocessor-based uniprocessor and multiprocessor systems, we present results in this subsection for a set of scientific and engineering applications. We begin with the uniprocessor architecture. For the sake of concreteness, we pattern our memory subsystem after a typical MIPS R4000-based workstation. The architecture consists of a single-issue processor running at a 100 MHz internal clock. The processor has an on-chip primary data cache of 8 Kbytes, and a secondary cache of 256 Kbytes. Both caches are direct-mapped and use 32 byte lines. The penalty of a primary cache miss that hits in the secondary cache is 12 cycles, and the total penalty of a miss that goes all the way to main memory is 75 cycles. In this simple model, we assume that all instructions execute in a single cycle and that all instructions hit in the primary instruction cache. The performance of the benchmarks was simulated by instrumenting the MIPS object code using pixie [74] and piping the resulting trace into our detailed cache simulator.
Figure breaks down the total program execution time into
instruction execution and stalls due to memory accesses for 13 uniprocessor
programs taken from the SPEC [77], SPLASH [72], and NAS
Parallel [8] benchmark suites. Many of the programs spend a
significant amount of time on memory accesses. In fact, 8 out of the 13
programs spend more than half of their time stalled for memory accesses.
We conducted a similar experiment to evaluate the impact of memory latency
on large-scale shared-memory multiprocessors by simulating the entire
SPLASH [72] parallel application suite on an architecture
resembling the Stanford DASH multiprocessor [54]. The
architecture we model includes 16 R3000 processors running at 33 MHz,
two-level cache hierarchies (64 Kbytes/256 Kbytes), and a low-latency
interconnection network. Miss latencies for loads range from 15 cycles to
the secondary cache to over 130 cycles for ``remote dirty'' lines. Further
details on this architecture and the parallel applications will be
presented later in Chapter . Figure
shows the results. Execution time is now broken into
three categories: time spent executing instructions, time spent stalled for
memory, and time spent stalled for synchronization (such as locks and
barriers). Once again, memory stalls are significant, with 7 of the 8
applications spending more than 35%of their time stalled waiting for
memory accesses to complete.