Chi-Keung Luk: Thesis Contributions

Primary Thesis Contributions:

To cope with instruction cache miss latency, I have proposed a novel prefetching technique called cooperative instruction prefetching whereby the hardware and the compiler cooperate to prefetch far enough in advance without polluting the cache. Experimental results demonstrate that cooperative prefetching hides 50% or more of the latency remaining with state-of-the-art instruction prefetching schemes on a modern processor.
I have designed three compiler-based prefetching schemes for pointer-based applications. My schemes are to date the unique of their kind in the literature. I have also implemented the most widely applicable scheme in a research compiler. Experimental results demonstrate that automatic compiler-inserted prefetching significantly improves the execution speed of pointer-based codes in both uniprocessors and large-scale shared-memory multiprocessors. In addition, the more sophisticated schemes can improve performance further by as much as twofold.
To reduce the runtime overheads of software-based latency tolerance techniques (e.g., software prefetching and load speculation), I have proposed a new profiling technique named correlation profiling that helps predict which dynamic instances of a static memory reference will hit or miss in the cache, whereby a latency tolerance technique can be applied only to those dynamic instances that miss. Experimental results show that roughly half of the 22 non-numeric applications studied can potentially enjoy significant reductions in memory stall time by exploiting correlation profiling.
I have proposed an architectural mechanism called memory forwarding which can always guarantee the safety of data relocation in C programs. With the support of memory forwarding, software can perform any dynamic data-layout optimizations without worrying about hurting the program correctness. I have also addressed some important issues of implementing memory forwarding in modern processors. Experimental results show that the aggressive locality optimizations enabled by memory forwarding greatly reduce both memory latencies and memory bandwidth consumption in a set of non-numeric applications, and hence offer impressive speedups: in some cases by more than twofold.