For this study, we have chosen an architecture that resembles the DASH
multiprocessor [54], a large-scale cache-coherent
machine that has been built at Stanford. Figure shows the
high-level organization of the simulated architecture. The architecture
consists of several processing nodes connected through a low-latency
scalable interconnection network. Physical memory is distributed among the
nodes. Cache coherence is maintained using an invalidating, distributed
directory-based protocol. For each memory block, the directory keeps track
of remote nodes caching it. When a write occurs, point-to-point messages
are sent to invalidate remote copies of the block. Acknowledgment messages
are used to inform the originating processing node when an invalidation has
been completed.
We use the actual parameters from the DASH prototype wherever possible, but
have removed some of the limitations that were imposed on the DASH
prototype due to design effort constraints. Figure also
shows the organization of the processor environment we assume for this
study. Each node in the system contains a 33MHz MIPS R3000/R3010 processor
connected to a 64 Kbyte write-through primary data cache. The write-through
cache enables processors to do single-cycle write operations. The
first-level data cache interfaces to a 256 Kbyte second-level write-back
cache. The interface includes read and write buffers. The write buffer is
16 entries deep. Reads can bypass writes in the write buffer if the memory
consistency model allows this. Both the first and second level caches are
lockup-free [45], direct-mapped, and use 16 byte lines. The
bus bandwidth of the node bus is 133 Mbytes/sec, and the peak network
bandwidth is approximately 120 Mbytes/sec into and 120 Mbytes/sec out of
each node.
The latency of a memory access in the simulated architecture depends on
where in the memory hierarchy the access is serviced.
Table shows the latency for servicing an access at
different levels of the hierarchy, in the absence of contention (the
simulations done in this study do model contention, however). The following
naming convention is used for describing the memory hierarchy. The local node is the node that contains the processor originating a given
request, while the home node is the node that contains the main
memory and directory for the given physical memory address. A remote
node is any other node. The latency shown for writes is the time for
retiring the request from the write buffer. This latency is the time for
acquiring exclusive ownership of the line, which does not necessarily
include the time for receiving acknowledgment messages from invalidations,
since the release consistency model is used [27].