- ...order.
- Hardware to support coherency mechanism is also needed
for multiprocessors.
- ...order.
- Lockup-free caches permit both buffering and pipelining.
For buffering alone, write buffers are sufficient.
- ...order.
- Explicit synchronization must be identified for
multiprocessors.
- ...order.
- The replicated thread state may include replicated
register files.
- ...subsystem,
- If the primary cache is checked before prefetches
proceed further through the memory hierarchy (we will discuss this in more
detail later in Section ) then the additional
memory subsystem loading caused by unnecessary prefetches will be limited
to increased primary cache tag contention.
- ...line.
- ``Memory line''
refers to cache-line-sized blocks of contiguous memory, which are the
unique elements that can be mapped into cache entries.
- ...accessed.
- If the
data are stored in column major order, then accesses to the same
cache line can only occur when the same column is accessed.
- ...size.
- In reality, the alignment of the two
references with respect to the cache line boundaries may also be a concern.
In the worst case, adjacent references may always straddle cache line
boundaries, potentially never reusing the same cache lines. In our
implementation, we account for this probabilistically by assuming that
two references within half a cache line width of each other probably fall
within the same cache line.
- ...boundary.
- In
the general case, 0 would be replaced by the lower bound of the loop, and a
constant offset would be added to the modulo prefetch predicate for spatial
locality in Table .
- ...again.
- Since scalar optimization is the
most time-consuming part of compilation, we assume that running full scalar
optimization twice is not a viable option.
- ...miss.
- Multiple store misses cannot be outstanding because the processor
stalls on a store miss.
- ...R0.
- R0 is
a reserved register that returns the value 0 when used as a source operand.
A load to R0 is essentially a NOP, and therefore this code
sequence is not normally generated by the compiler.
- ...experiments
- We chose explicitly-parallelized
applications because we wanted to use highly-optimized, ``real''
applications for our study, and therefore did not want to be constrained by
the limitations of today's automatic parallelizing compiler technology. The
explicitly-parallelized cases are also interesting because the compiler has
the least amount of information to work with, which makes getting good
performance more challenging for the prefetching algorithm.
- ...unavoidable.
- One technique that may help in this case is a
``producer prefetch'', as implemented in the DASH prototype. With a
producer prefetch, the processor that generates the column would send it
out to the processors waiting for it. This technique has the advantage that
it has better hot-spotting performance than ``consumer prefetch'' (which is
what we normally use), since only a single message is sent out from the
hotspot, rather than having large numbers of request messages backing up at
the hotspot, and then sending out the responses.
- ...inserted.
- We observed this
in practice. At first, our scalar optimizer was not taking advantage of the
constant offset difference. The performance impact was devastating in many
cases, once register spilling occurred.
- ...dropped.
- If TLB refills (i.e. address translation)
are handled by hardware, then the prefetch can potentially hide the latency
of both the TLB refill and the memory access. On the other hand, if TLB
refills are handled through software exceptions (as in the MIPS
architecture), then only the memory latency can be hidden.
- ...data.
- The two cases that are not shown are
TOMCATV and OCEAN. These two applications suffered more from memory
latency when the caches are not checked than any of the other
applications-so much so that the simulations never ran to
completion.
- ...data.
- Note that this
is a conservative approximation because a prefetch may have been displaced
by another prefetch, in which case the displaced prefetch has no
benefit, but also does not increase the number of misses by
displacing data that would not have otherwise missed.
- ...cache.
- In an architecture that prefetches only into the
secondary cache, prefetches would have a cost and no benefit in cases
where the data is found in the secondary cache.
- ....
- Note that there is even more to
providing memory subsystem bandwidth than having a lockup-free cache. In
addition, main memory must have sufficient bandwidth to service these multiple
outstanding misses.
- ...stores
- Stores stall the processor
since the R4000 (the processor on which we base our uniprocessor
architecture) has a write-back cache and no write buffer.
- ...prefetch.
- If the prefetch
issue buffer is already full, then either the processor must stall until an
entry becomes available or else drop the prefetch. The tradeoff between
these two choices was described earlier in Section
, and for this study we will stall until an entry
becomes available.
- ...prefetch.
- The one
possible exception is when a read-exclusive prefetch follows a read-shared
prefetch of the same line. In this case, if the read-shared prefetch has not
been issued to the memory system yet, it should be converted to a
read-exclusive prefetch. Otherwise, the options are to either drop the
read-exclusive prefetch or else issue it separately. We chose the latter for
our simulations, but it made little difference because our selective
prefetching algorithm is good at avoiding such redundancies.
- ...architecture).
- The reason for this difference is that
our uniprocessor architecture is based on the MIPS R4000 processor, while
the multiprocessor architecture is patterned after DASH, which contains
R3000 processors.
- ...blocks.
- An example of this is the pixie utility provided by
MIPS Computer Systems, Inc. Pixified code generally runs roughly half
as fast as the original code.
- ...effectively.
- Scheduling prefetches the right amount of
time in advance becomes much more difficult when software pipelining cannot
be used (e.g., when traversing a linked list). In such cases, dependencies
may make it quite difficult to move prefetches sufficiently far in advance
to hide the latency.
- ...cache.
- The exception would be if
the stride detection hardware only looked at references that suffered cache
misses.
- ...simultaneously.
- The RC model would even allow multiple read misses
to occur simultaneously, but this does not occur in our architecture since
it does not support non-blocking loads.
- ...cycles.
- We show OCEAN with only two contexts because the problem size
we use cannot be run with 64 processes.
- ...instructions,
- The instructions category
includes both ``useful'' instructions and instructions wasted while
spinning on idle work queues. For our previous experiments, these latter
instructions were included in the synchronization rather than the
instruction category.