Primary Thesis Contributions:
- To cope with instruction cache miss latency, I have proposed a
novel prefetching technique called cooperative instruction
prefetching whereby the hardware and the compiler cooperate to
prefetch far enough in advance without polluting the
cache. Experimental results demonstrate that cooperative prefetching
hides 50% or more of the latency remaining with state-of-the-art
instruction prefetching schemes on a modern processor.
- I have designed three compiler-based prefetching schemes for
pointer-based applications. My schemes are to date the
unique of their kind in the literature. I have also implemented
the most widely applicable scheme in a research compiler.
Experimental results demonstrate that automatic compiler-inserted
prefetching significantly improves the execution speed of
pointer-based codes in both uniprocessors and large-scale shared-memory
multiprocessors. In addition, the more sophisticated schemes can
improve performance further by as much as twofold.
- To reduce the runtime overheads of software-based latency tolerance
techniques (e.g., software prefetching and load speculation), I have
proposed a new profiling technique named correlation profiling
that helps predict which dynamic instances of a static memory reference
will hit or miss in the cache, whereby a latency tolerance technique
can be applied only to those dynamic instances that miss. Experimental
results show that roughly half of the 22 non-numeric applications
studied can potentially enjoy significant reductions in memory stall
time by exploiting correlation profiling.
- I have proposed an architectural mechanism called memory forwarding
which can always guarantee the safety of data relocation in C programs.
With the support of memory forwarding, software can perform any dynamic
data-layout optimizations without worrying about hurting the program correctness.
I have also addressed some important issues of implementing memory forwarding
in modern processors. Experimental results show that the aggressive locality
optimizations enabled by memory forwarding greatly reduce both memory latencies
and memory bandwidth consumption in a set of non-numeric applications, and
hence offer impressive speedups: in some cases by more than
twofold.