In this section, we present experimental results to evaluate the
performance of our multiprocessor prefetching algorithm. We begin with a
high-level overview of the performance improvements. We then focus in
greater detail on several key aspects of the algorithm-locality analysis,
scheduling prefetching, prefetching indirect references, and using
exclusive-mode prefetches-in Sections ,
,
, and
, respectively. We looked at several of
these same aspects of the algorithm earlier in
Chapter
, but only within the context of a
uniprocessor environment; many of these issues change or deserve a
closer look under a multiprocessor environment. We start now with a
brief performance overview.
The performance of the original code (i.e. without prefetching) for
the eight multiprocessor applications is shown in Figure
. Notice that there is a new component of
execution time: time spent stalled for synchronization. These
synchronization operations include events such as acquiring locks, waiting
for other processors to reach barriers (i.e. poor load balancing), and
instructions spent spinning on empty task queues. The prefetching compiler
algorithm was applicable to five of the eight multiprocessor applications
(OCEAN, LU, MP3D, CHOLESKY, and LOCUS). We examine those five applications
in this section, and will discuss the other three applications (WATER,
PTHOR, and BARNES) later in Section
. The
overall performance of the five cases that did improve is shown in Figure
.
The speedups of the applications in Figure
range from 6%to 113%, with three of the five speeding up by 45%or
more. The memory stall times have been reduced by 31%to 88%through a
combination of lower miss rates and lower average miss penalties, as shown
in Table
. In two cases (OCEAN and LU), more
than half of the synchronization latency was eliminated. This was due to
improved load balancing, since prefetching helps to reduce the variability
between task sizes. However, prefetching did not reduce the synchronization
latency of CHOLESKY, where most of the synchronization time is spent in
explicit spin loops waiting for more tasks to be produced. Figure
also shows that the overheads of prefetching
(both instruction and memory stall overheads) are relatively small,
therefore translating much of the memory stall reduction into actual
performance benefit. In the following subsections, we examine each aspect
of the multiprocessor prefetching algorithm in greater detail.