Suggestions from Prof. John Shen:
(NOTE: there are many variations on this theme of tailoring an architecture to a specific application domain.)
Even a partial answer (i.e., a project that addresses only a subset of the issues) would still be interesting.
To assess the effectiveness of compiler transformations that exploit value locality, it is necessary to run many simulations. Ideally these simulations run as fast as possible, i.e. use the native instruction set and memory system of the host that executes the simulation. In this project, we are looking for an innovative setup to run such simulation as fast as possible.
Given a number T of threads that are to be executed, and a memory system (consisting of at least one level of caches with a block size of B and a replacement algorithm S), how can we simulute the execution of the T threads for a number of block sizes B1, B2, ..., Bn and a number of replacement strategies S1, S2, ..., Sn as fast as possible. Are there clever memory management techniques that allows us to save some of the work? Note that we need to keep track of the VALUES that are stored in the cache(s) -- that's the major difference to many other cache simulation setups that focus only on the ADDRESSES that are generated by a program.
Traditional 3-issue in-order superscalar:
ALUA in stage 0 ALUB in stage 0 BRANCH in stage 0 AGEN in stage 0 DCACHE in stage 1The sequence below issues at the labeled times:
r1 = *p; // issue T=0, use AGEN in 0, DCACHE in 1 r3 = r1 + r2; // issue T=2, use ALUA in 2 r4 = r1 - 1; // issue T=2, use ALUB in 2 r5 = r3 < 10; // issue T=3, use ALUA in 3 if (r5) goto L; // issue T=4, use BRANCH in 4Now consider a machine in which ALUs are cheap, so add 1 more, but put the ALUs in different stages. Multi-issue is still expensive enough to limit to 3-way.
ALUA0 in stage 0 BRANCH0 in stage 0 AGEN in stage 0 DCACHE in stage 1 ALUA1 in stage 1 BRANCH1 in stage 1 ALUA2 in stage 2 BRANCH2 in stage 2The sequence below issues at the labeled times:
r1 = *p; // issue T=0, use AGEN in 0, DCACHE in 1 r3 = r1 + r2; // issue T=0, use ALUA2 in 2 r4 = r1 - 1; // issue T=1, use ALUA2 in 3 r5 = r3 < 10; // issue T=2, use ALUA1 in 3 if (r5) goto L; // issue T=2, use BRANCH2 in 4The processor tends to issue instructions earlier, but do the computation at a similar point in absolute time. So what is the advantage over the traditional approach? By issuing earlier you move on to the next instructions sooner. If you ever find an instruction that wasn't scheduled "soon enough" you get ahead of the traditional approach. Why would such an instruction exist? Because of the restriction of most scheduling to be within a basic block.
Figure 1. Horizontal issue Slot T 0 1 2 3 0 A B A and B are independent 1 C D E C, D, and E are independent 2 3 F 4 G H I J 5 K Figure 2. Vertical issue Slot T 0 1 2 3 0 A C F G A feeds C feeds F feeds G 1 B D H B feeds D feeds H 2 E I K E feeds I feeds K 3 J J depends on FLike #1, this might be advantageous primarily for crossing basic block boundaries more gracefully. To build this, you chain your functional units together in successive stages rather than having them operating in parallel. The major stumbling block is that expensive functional units (e.g. dcache) cannot be replicated, and so must go in a fixed place. For example, you might require all load instructions to start a chain.
Identify other potential fused instruction opportunities. I.e. find candidates where the 3-operand instructions
z = f(x, y) w = g(u, v)can be profitably done with a 4-operand fused instruction
d = g(a, f(b, c))Give statistics on the these. Identify which ones can reduce latency. E.g. f=+, g=+ can be implemented with a carry-save stage in front a normal adder, and so be done in a single cycle, instead of 2 cycles.
Consider instead a "load when changed" instruction. Processor X executes load when changed and tries to read the containing line from memory. If the line turns out to be in processor Y's cache instead, and the location's value is not equal to what X specified, it is transferred to X right away. If it does contain X's value, the request is queued in Y, and the next write to the location initiates the line transfer from Y to X. Note that Y can transfer the line early if it wants (e.g. if it runs out of space to queue all the requests it has received).
When Z requests the line from Y and finds X waiting for it, it is forwarded to X. Thus when multiple processors all want a lock, they form a queue in the order they requested the lock. This makes mutual exclusion regions as efficient as a single line transfer from one processor to the next to the next...
Note also that this instruction is useful for multithreaded processors, since it allows the processor to suspend a thread instead of letting it spin.
Let's try a concrete example. An OOO processor fetches instructions, renames them, and dumps them into queues in from of functional unit sets (e.g. a 32-entry ALU queue, or a 16-entry branch queue, or a 32-entry load/store queue). 3 ALUs execute the earliest 3 ready instructions from the ALU queue, 2 branch units execute misprediction detection from earliest 2 ready instructions in the branch queue, and 2 load/store units execute the first 2 ready instructions from the load/store queue. So far, just one variety of your basic OOO processor. Now imagine that the search for ready instructions is limited to the top 4 or 8 elements of each queue. This makes the selection circuits much faster. What is the cost to IPC? The machine still retains a signficant OOO component. E.g. even with the top 2 entries blocked by a L2 cache miss, other instructions can be retired from the next 2 entries, which are then refilled from the non-searched part of the queue.
How does simultaneous multithreading (SMT) affect this machine?
Could this functionality be used to improve performance? If so, in which cases, and by how much? Perhaps operating systems are an interesting starting point, since they apparently do a large amount of data copying. This approach does restrict the types of copying that one can do, since it is restricted to entire rows in the DRAM chips. What if we augmented the hardware to not only do copying, but to also quickly zero out a row, or to perform simple boolean operations on two rows. Would this idea work?