Prefetching in LLVM Final Report

Nathan Snyder and Q Hong

Motivation

As processors have greatly increased in speed over time, improvements to memory access speeds have been less significant. This creates a situation in which the processor is frequently waiting long periods of time for the memory system to provide it some data. Caches are a powerful and commonplace strategy for mitigating this inefficiency, but cache misses still constitute one of the major hindrances on processor performance.  One technique for further reducing memory stalls is to have the software prefetch data that it knows it will soon need into the cache. As there is usually no good way to do this at the source code level, the identification and exploitation of these opportunities falls to the compiler. While prefetching is potential powerful, it must be employed carefully because prefetching data which already resides in the cache or which is not actually needed will degrade performance.  Our goal in this project was to utilize software prefetching techniques within the LLVM compiler.

Our Approach

Our goal was to apply prefetching in several different ways within the LLVM compiler, using the prefetch intrinsic function within the LLVM intermediate representation. We focused primarily on using prefetching in the situations where it has the greatest potential benefits; specifically, the traversal of pointer based data structures that are scattered through memory and iteration over large arrays. However, we did experiment with prefetching in situations that were more likely to have a small or possibly even harmful effect on program performance.

Related Work

In creating the recursive data structure prefetching algorithms, we used ideas from Chi-Keung Luk[1]. He worked with a broader definition of what constituted a recursive data structure access and was not working with the LLVM compiler infrastructure. Mowry et al.[2] propose an algorithm to place prefetching for dense matrix uses which are often used in scientific programs. In the paper, they analyze locality of each memory reference and place prefetching when the reference has either temporal or spatial locality. Our algorithm is similar to their framework in that we insert prefetching for array access which has locality. Instead of using their locality analysis framework, however, we simplify the analysis and identify common cases of reuses beneficial for prefetching.

Using Prefetching in LLVM

Software prefetching in the LLVM compiler is accomplished through the use of an LLVM intrinsic function. Intrinsic functions in the LLVM intermediate representation are function call instructions where the called function is some member of a set of well-known functions. During compilation to native code, however, rather than being converted into actual function calls they become special instructions or sequences of instructions on the native architecture. Intrinsic functions were designed as a way to extend the LLVM intermediate representation without needing to update every single pass that inspects the instruction types, since a use of an intrinsic function would appear identical to any other function call.

In the early planning phase of our project we expected to have one optimization pass that would apply all of our strategies to a program. However, when we began mapping our plans onto the concrete LLVM infrastructure we found that because of its concept of a pass that inspected every object of some type (function, loop, etc) in a program it was actually easier for us to break the problem up into several independent passes. This included separating recursive data structure prefetching into separate passes to target either loop traversals or recursive function traversals. While this introduced some issues like the possibility of different passes trying to redundantly prefetch the same addresses, it kept our code base simpler and more focused.

The capabilities of LLVM also impacted our optimization designs. In Luk’s thesis he outlined three approaches to recursive data structure prefetching: greedy prefetching, history pointer prefetching, and data-linearization prefetching. Greedy prefetching was very applicable, but because data-linearization prefetching involves changing the memory allocator, there was no way for us to use it within the scope of this project.  We were also unexpectedly unable to use history pointer prefetching because LLVM didn’t have a good way to modify struct types in the way required by that technique.

Prefetch Placement Algorithms

As mentioned above, we used greedy prefetching for recursive data structure traversals, and handled traversals by loops and recursive functions separately. To identify optimization targets in recursive functions, we looked at each recursive call and checked if it had an outgoing struct pointer argument that was derived from the corresponding incoming argument. This captures the valuable case of traversals which process some data structure node then get pointer to other nodes from it and recursively process them, while excluding arguments which are likely less valuable to prefetch. Once we have identified a pointer value we would like to prefetch, we must insert that prefetch and possibly rearrange some instructions involved in the computation of that address. This rearrangement is crucial, because the incoming byte code usually computes the address only one or two instructions before the recursive call and the prefetch must be released earlier than that to have good results. At the same time, we are constrained by the fact that the address computation is potentially unsafe since it involves dereferencing the original struct pointer. For example, we cannot place the address computation earlier than a null check that determines whether or not the original pointer should be used. To solve this problem, we used the policy of placing the address computation and prefetch at the earliest point in the control flow graph at which the address computation was guaranteed to eventually happen. This was location was determined by traversing the CFG backwards from the existing address computation until we encountered a node which had some successor that didn’t lead to the address computation/recursive call. We found that this policy, while simple, usually placed prefetches in good locations (and always in safe ones).

Using RDS prefetching in loop traversals was similar. In identification, the key difference was that we were now looking for phi instructions where one of the incoming values was eventually derived from the result of the phi instruction. This captures the behavior of a “current node” pointer that is updated with each loop iteration. The placement of the prefetch and address computation was also done similarly, with the small added constraint of keeping the prefetch within the loop body.

Another prefetching target we examined was prefetching chunks of arrays that are being iterated over. For this, we insert a prefetch for an instruction which accesses an array inside a loop. Prefetching can be beneficial to an array access in a loop because the array access in a loop forms a regular pattern in accessing the memory and make a memory access predictable, assuming that the array is a contiguous block in a memory. When we find array access instructions, we test whether each array index of the instruction is either loop-invariant or an affine function of induction variables. Each array access instruction is tested in its innermost loop. For the qualified instructions, we insert prefetches in two ways.

1. Basic prefetching on array access:  For each array access instruction, we insert prefetch for the instruction which is reference after d loop iteration. For instance, when we reference A[i] with the innermost loop iterated with i, we prefetch A[i+d] as seen in the Figure 1.   

for( i=0; i<N; i++) {

     Use A[i]

}

 

 

(a) Before prefetch insertion

For(i=0; i<N; i++) {

     Prefetch( A[i+d] )

     Use A[i]

}

 

(b) After prefetch insertion

Figure 1. Basic prefetching algorithm.

The distance d should be determined based on the latency of memory load and the size of cache. When d is too small, prefetched data cannot be loaded on cache before the actual array access is reference. On the other hand, when d is too big, it may not remain in the cache, but replaced with other memory reference.

To reduce the number of instructions to be prefetched,  we test prefetching placement for all memory instruction of a loop at the same time and remove redundancy. Redundancy is evaluated based on group reuse information. We call an array access instruction redundant when they share a group reuse, that is, an array location referenced from one instruction may be referenced by other  We only prefetch once when two array access instruction. We reduce the number of redundant prefetch insertion by analyzing group reuse.

2. Advanced prefetching using a spatial locality: We further reduce the number of redundant prefetching by analyzing spatial reuse.  When an array access has spatial reuse, its neighboring memory reference is going to be referenced in the near future and make it redundant to prefetch every iteration. We test spatial locality using a simple test;  for each array access,  if the last index is a function of induction variable of the innermost loop and truncated matrix except for the last index is loop-invariant, we consider this access having spatial locality. When we find the spatial locality in the array access, we unroll the innermost loop by the cache line size and then insert prefetch.

 

One additional application of prefetching that we implemented was less focused on specific high-value opportunities.  This pass considered all loads from memory within a program and created a prefetch for them if the address being loaded was computed far enough away from the load site that a prefetch was likely to have a significant effect on latency. Rearranging the address computation code here was not necessary as it had been with recursive data structure prefetching.

Testing Strategy

To test the results of our optimizations, we used both qualitative and quantitative methods. During development, we tested our code by inspecting the llvm byte code files to see where prefetches were being inserted and compare these to the optimal locations that a human would choose. This let us check that our optimizations we behaving correctly and selecting good positions for prefetches, which (after debugging, of course) was found to be true. The next step was then to take quantitative measurements of program performance with or without our optimizations to assess the real impact that those prefetches were having on programs. To do this, we compiled our optimized LLVM byte code to x86-64 assembly and then analyzed its performance using PTLsim, a cycle-accurate x86 microprocessor simulator. We installed PTLsim on the 64 bit CMU Linux cluster machines, because versions we tried for our personal machines behaved incorrectly.  We then had llc, LLVM’s byte code to native code compiler, produce x86-64 assembly out of our optimized programs. These were linked and simulated on the cluster machines using the default settings for PTLsim, and then we inspected statistics on data cache performance total cycles.

Results

We wrote several tests which contained targets for our optimizations and compared their performance with or without prefetching. While our set of tests was somewhat small, this is because we saw little point in testing things like recursive data structure prefetching on code which didn’t utilize recursive data structures.

  For recursive traversal prefetching on linked lists, we saw significant improvements due the addition of prefetches, with the total time going from 8,795,963 cycles to 5,231,629 cycles. When examining the cache statistics to find the underlying cause of the improvement, we found that it was due to the almost complete elimination of loads needing to be replayed due to the data not being ready, which constituted 27% of the memory accesses in the unoptimized program. We suspect this may be a result of the hardware trying to prefetch the data but not having enough foresight to get it early enough.  We suspect that even greater benefits from prefetching are theoretically possible in this case, because we discovered that the x86-64 prefetch instruction is only able to pull data into the L2 cache rather than all the way into the L1 cache.  When testing traversal prefetching for linked lists that used traversal loops rather than recursive calls, we got similar data. Total time went from 5,301,792 cycles to 3,046,979 cycles for the same underlying reasons.  In testing trees we still saw some improvement in the cache behavior but it was less significant than expected, with some replays persisting. This may be because the left subtree is accessed to quickly after the prefetch is launched (even though our code placed both the left and right prefetch at the beginning of the function, as would be expected) and perhaps the right subtree was forced out of the cache by the time the traversal returned to it.

  For array iteration prefetching we were surprised to find that the performance characteristics for both the unoptimized code and optimized code were almost identical. We suspect that because of the relative predictability of the memory accesses in these situations the hardware was able to prefetch the array data just as well as we were, preventing gains, while at the same time our prefetches fit into the instruction pipeline in a way that didn’t cause noticeable overhead.

We also didn’t see noticeable differences in performance for the generic access prefetching. It is likely that because these prefetches are executed less frequently and are of lower expected value we would need to be more significantly more sophisticated in our identification of targets and especially our scheduling of address computations and prefetches to see benefits.

Surprises and Lessons Learned

There were a few things that surprised us as we worked on this project. The first one that arose was the difficulty of finding decent documentation on the LLVM intrinsics and prefetching in particular. This was largely non-existent and what we did find was scarce on details about what actually needed to be done to use them in code. We eventually pieced together the process through experimentation and (unanswered) forum posts by people with similar problems, but it made the beginning of the project go more slowly than expected.  However, once we had learned the details of the LLVM system we were surprised again by what it was capable of.  The support for easily manipulating programs and the large library of analyses proved very valuable and made our final solution simpler than it would likely have been working within another compiler framework.

The major lesson we learned about executing such projects is to set up and validate the tools needed for all phases of the project at the beginning. We planned on doing our performance testing near the end of the project and so didn’t work with our testing tools beyond identifying which ones would be useful. When it came time to gather data, we were beset by a wave of problems with issues like architecture incompatibilities and missing libraries on the CMU cluster machines that were troublesome to work around.

Conclusions and Future Work

The conclusions that we can draw from this project are that prefetching has the power to greatly improve program performance, but only for a fairly narrow set of applications. It is unnecessary to use software prefetching in situations that today’s sophisticated hardware is able to handle on its own. We also saw evidence of the importance of scheduling prefetches enough ahead of time to get their full effect; it would have been interesting if we could implement a more sophisticated, farther looking prefetching algorithm such as history pointers within LLVM. Based on our experience implementing these algorithms, we would conclude that the primary challenge in using prefetching is in the scheduling of the prefetches and, very importantly, the adjustment of surrounding code to allow better prefetch placement while not breaking the program.

There are several possibilities for future work related to this project. One could apply prefetching to situations that weren’t considered in this project. Further development could also be done in finding optimal placements for prefetch instructions, using more complicated analyses than were utilized here.

References

1.       C.-K. Luk and T. C. Mowry. “Compiler-Based Prefetching for Recursive Data Structures.” In Proceedings of ASPLOS-VII, Oct. 1996, pp. 222-233

2.       T.C. Mowry, M.S. Lam and A. Gupta “Design and Evaluation of a Compiler Algorithm for Prefetching.” In Proceedings of ASPLOS-V. Oct. 1992, pp. 62-73