In this chapter we present our compiler algorithm for prefetching dense-matrix codes. These applications are a top priority since they consume large numbers of cycles on supercomputers today, they have poor caching behavior, and yet they have regular enough access patterns that prefetching has a reasonable chance of success. Therefore we start by handling these affine array accesses (i.e. where the index functions are affine expressions of the surrounding loop variables), and later build upon this core algorithm in subsequent chapters to handle other important cases such as indirect references and multiprocessor codes.
This chapter is organized in five sections. The first section discusses
some key concepts for compiler-based prefetching, including the need to
avoid unnecessary overhead. In light of these goals,
Section provides an overview of our compiler
algorithm, which includes both an analysis phase and a scheduling phase.
Details of these two phases are presented in
Sections
and
,
respectively. The analysis phase uses locality analysis to predict
which references should be prefetched, and the scheduling phase first uses
loop splitting to isolate those dynamic miss instances, and then uses
software-pipelining to schedule prefetches the proper amount of time
in advance. Finally, Section
makes our
algorithm concrete by showing the output for some example code, and also
discusses issues that arose when implementing the algorithm in the SUIF
compiler, such as whether prefetch insertion should occur before or after
scalar optimization. The success of this algorithm will be evaluated later
in Chapter
.