PTHOR is an interesting case because it is second only to LU in the amount of time lost to memory latency (59%), and yet our compiler algorithm fails to insert any prefetches. The problem (as in BARNES) is that PTHOR is an irregular computation containing mainly pointers to linked lists, which are beyond the scope of our algorithm. The approach we take to inserting prefetches by hand was described in detail in an earlier study [61], and we briefly summarize that strategy here.
One of the main data structures in PTHOR is the element record, which stores all information about the type and state of a logic element. Several fields in the record are pointers to linked lists, or are pointers to arrays that in turn point to linked lists. During the main computational loop, each processor picks up an activated logic element, computes any changes to the element's outputs, and schedules new input events for elements that are affected by the changes. Prefetching is complicated by the presence of the linked lists, since to prefetch a list it is necessary to dereference each pointer along the way.
To insert prefetches by hand, we first reorganized the element record and grouped entries based on whether they were likely to be modified, likely to be read but not modified, or likely not to be referenced. Whenever a processor picks an element from a task queue, we prefetch the element record entries accordingly. In addition, we prefetch the first several levels of the more important linked lists. Due to the complex control structure of the application, it is difficult to determine where the misses occur. Despite the aid of profiling information that helped determine which sections of code were generating misses, we were only able to increase the coverage factor to 36%. A total of 29 lines were added to the source code, resulting in a performance improvement of 23%, as shown in Figure .
Adding prefetching to PTHOR was a qualitatively different experience from adding prefetching to MP3D, LU, and even BARNES. While the process of inserting prefetches took considerably more time, the resulting coverage was much lower. Unlike BARNES, it was very difficult to isolate the misses in PTHOR, even with the help of profiling information, since they are spread across so many different references and often occur only sporadically. Even the author of PTHOR had a difficult time predicting which references should be prefetched. This is partly because PTHOR is a significantly more complicated program, both in terms of its data structures and its control flow. In many cases, although we knew from profiling information that a particular reference was suffering from cache misses, we could find no way to schedule the prefetches that would improve performance. Therefore, of the programs we have studied, PTHOR appears to be the most difficult challenge for the compiler. Given that we did achieve a speedup of 23%, it appears to be a challenge worth pursuing.