In LU, the matrix columns are statically assigned to the processors in an interleaved manner. The main computation done by each processor consists of reading a pivot column once it is produced, and applying the pivot column to each column to its right that the processor owns. There are three primary sources of misses in LU: (i) the pivot column when it is read for the first time (9%); (ii) the pivot column when it is replaced by a column it is applied to and needs to be refetched (12%); and (iii) the owned columns that the pivot column is applied to (75%). This last set of misses occurs because the combined size of the owned columns is larger than the size of the cache.
Our strategy for prefetching LU by hand was the following. Each time the pivot column is applied to an owned column, we prefetch the pivot column in shared mode and the owned column in exclusive mode. Although prefetching the pivot column each time causes redundant prefetches, it reduces the misses when the pivot column is replaced from the processor's cache, resulting in a total coverage factor of 96%. We found that it is better to evenly distribute the issue of prefetches throughout the computation rather than prefetching an entire column in a single burst, in order to avoid hot-spotting problems. We also unrolled the loop to minimize instruction overhead, since there is spatial locality. A total of 8 lines were added to the source code.
Our compiler chose an identical approach for inserting prefetches. It also prefetches the pivot column and owned columns each time they are used. Since the key inner loop is inside a separate procedure, the compiler does not recognize that the pivot column may have temporal locality, and therefore might not need to be prefetched each time. However, it turns out that prefetching the pivot column all the time is the appropriate strategy, and therefore both schemes perform quite well, as we see in Figure (b). The small difference in instruction count was because the scalar optimizer did a slightly better job of optimizing the compiler-generated code than it did on the hand-inserted code. Both schemes eliminated roughly 90%of the memory stalls, and increased overall performance by more than 85%. Therefore the compiler clearly lives up to the potential of hand-inserted prefetching for both MP3D and LU.