For this study, we have chosen an architecture that resembles the DASH multiprocessor [54], a large-scale cache-coherent machine that has been built at Stanford. Figure shows the high-level organization of the simulated architecture. The architecture consists of several processing nodes connected through a low-latency scalable interconnection network. Physical memory is distributed among the nodes. Cache coherence is maintained using an invalidating, distributed directory-based protocol. For each memory block, the directory keeps track of remote nodes caching it. When a write occurs, point-to-point messages are sent to invalidate remote copies of the block. Acknowledgment messages are used to inform the originating processing node when an invalidation has been completed.
We use the actual parameters from the DASH prototype wherever possible, but have removed some of the limitations that were imposed on the DASH prototype due to design effort constraints. Figure also shows the organization of the processor environment we assume for this study. Each node in the system contains a 33MHz MIPS R3000/R3010 processor connected to a 64 Kbyte write-through primary data cache. The write-through cache enables processors to do single-cycle write operations. The first-level data cache interfaces to a 256 Kbyte second-level write-back cache. The interface includes read and write buffers. The write buffer is 16 entries deep. Reads can bypass writes in the write buffer if the memory consistency model allows this. Both the first and second level caches are lockup-free [45], direct-mapped, and use 16 byte lines. The bus bandwidth of the node bus is 133 Mbytes/sec, and the peak network bandwidth is approximately 120 Mbytes/sec into and 120 Mbytes/sec out of each node.
The latency of a memory access in the simulated architecture depends on where in the memory hierarchy the access is serviced. Table shows the latency for servicing an access at different levels of the hierarchy, in the absence of contention (the simulations done in this study do model contention, however). The following naming convention is used for describing the memory hierarchy. The local node is the node that contains the processor originating a given request, while the home node is the node that contains the main memory and directory for the given physical memory address. A remote node is any other node. The latency shown for writes is the time for retiring the request from the write buffer. This latency is the time for acquiring exclusive ownership of the line, which does not necessarily include the time for receiving acknowledgment messages from invalidations, since the release consistency model is used [27].