The MP3D application spends most of its time executing a loop where each processor takes a particle and moves it through one time step. The overwhelming majority of cache misses are caused by references to two structures within this loop: (i) the particle which is being moved (36%of misses), and (ii) the space cell where the particle resides (53%of misses). Particles are statically assigned to processors and are allocated from the shared memory local to each processor, while the memory for the space cells is distributed uniformly among the processors.
We inserted prefetches into MP3D by hand as follows. Since a particle must be referenced to determine the space cell it occupies, we prefetch a particle record two iterations before its turn to be moved. In the iteration following the prefetch, the particle is read, and the associated space cell is determined and prefetched. As a result, when it is time for the particle to be moved, both the particle and space cell records are available in the cache. We also prefetch several other references that occur at time step boundaries. The end result is a coverage factor of 90%for our hand-insertion scheme. Exclusive-mode prefetches are used since the objects are modified during each iteration. Introducing these prefetches required adding 16 lines to the source code.
When our compiler inserted prefetches into MP3D, it recognized that the
address of a space cell is computed based on the x, y, and z fields in a particle record (which represent the coordinates of the
space cell). Since this is an indirect reference, the compiler used the
algorithm described in Section to prefetch
the particles two iterations ahead, and the space cells one iteration
ahead. The scheduler determined that only a single iteration is needed to
hide the memory latency, since the loop body is rather large. Therefore
the compiler duplicated the hand-inserted approach to prefetching
particles and space cells, resulting in a coverage factor of 89%. The
compiler also prefetched a few other references at time step boundaries,
but they turned out to be insignificant.
Figure (a) shows the performance of both the
compiler-based and the hand-inserted prefetching schemes for MP3D. As we
see in this figure, they both do quite well. The hand-inserted case
performs slightly better simply because the scalar optimizer was able to
eliminate a few more instructions in that case, but this difference is
basically in the ``noise''. Therefore the compiler-based scheme appears
to be living up to its potential in this case.