Suggestions from Prof. Seth Goldstein:
Suggestions from Prof. Babak Falsafi:
Suggestions from Prof. James Hoe:
(NOTE: there are many variations on this theme of tailoring an architecture to a specific application domain.)
Traditional 3-issue in-order superscalar:
ALUA in stage 0 ALUB in stage 0 BRANCH in stage 0 AGEN in stage 0 DCACHE in stage 1The sequence below issues at the labeled times:
r1 = *p; // issue T=0, use AGEN in 0, DCACHE in 1 r3 = r1 + r2; // issue T=2, use ALUA in 2 r4 = r1 - 1; // issue T=2, use ALUB in 2 r5 = r3 < 10; // issue T=3, use ALUA in 3 if (r5) goto L; // issue T=4, use BRANCH in 4Now consider a machine in which ALUs are cheap, so add 1 more, but put the ALUs in different stages. Multi-issue is still expensive enough to limit to 3-way.
ALUA0 in stage 0 BRANCH0 in stage 0 AGEN in stage 0 DCACHE in stage 1 ALUA1 in stage 1 BRANCH1 in stage 1 ALUA2 in stage 2 BRANCH2 in stage 2The sequence below issues at the labeled times:
r1 = *p; // issue T=0, use AGEN in 0, DCACHE in 1 r3 = r1 + r2; // issue T=0, use ALUA2 in 2 r4 = r1 - 1; // issue T=1, use ALUA2 in 3 r5 = r3 < 10; // issue T=2, use ALUA1 in 3 if (r5) goto L; // issue T=2, use BRANCH2 in 4The processor tends to issue instructions earlier, but do the computation at a similar point in absolute time. So what is the advantage over the traditional approach? By issuing earlier you move on to the next instructions sooner. If you ever find an instruction that wasn't scheduled "soon enough" you get ahead of the traditional approach. Why would such an instruction exist? Because of the restriction of most scheduling to be within a basic block.
Figure 1. Horizontal issue Slot T 0 1 2 3 0 A B A and B are independent 1 C D E C, D, and E are independent 2 3 F 4 G H I J 5 K Figure 2. Vertical issue Slot T 0 1 2 3 0 A C F G A feeds C feeds F feeds G 1 B D H B feeds D feeds H 2 E I K E feeds I feeds K 3 J J depends on FLike #1, this might be advantageous primarily for crossing basic block boundaries more gracefully. To build this, you chain your functional units together in successive stages rather than having them operating in parallel. The major stumbling block is that expensive functional units (e.g. dcache) cannot be replicated, and so must go in a fixed place. For example, you might require all load instructions to start a chain.
Consider instead a "load when changed" instruction. Processor X executes load when changed and tries to read the containing line from memory. If the line turns out to be in processor Y's cache instead, and the location's value is not equal to what X specified, it is transferred to X right away. If it does contain X's value, the request is queued in Y, and the next write to the location initiates the line transfer from Y to X. Note that Y can transfer the line early if it wants (e.g. if it runs out of space to queue all the requests it has received).
When Z requests the line from Y and finds X waiting for it, it is forwarded to X. Thus when multiple processors all want a lock, they form a queue in the order they requested the lock. This makes mutual exclusion regions as efficient as a single line transfer from one processor to the next to the next...
Note also that this instruction is useful for multithreaded processors, since it allows the processor to suspend a thread instead of letting it spin.
Let's try a concrete example. An OOO processor fetches instructions, renames them, and dumps them into queues in from of functional unit sets (e.g. a 32-entry ALU queue, or a 16-entry branch queue, or a 32-entry load/store queue). 3 ALUs execute the earliest 3 ready instructions from the ALU queue, 2 branch units execute misprediction detection from earliest 2 ready instructions in the branch queue, and 2 load/store units execute the first 2 ready instructions from the load/store queue. So far, just one variety of your basic OOO processor. Now imagine that the search for ready instructions is limited to the top 4 or 8 elements of each queue. This makes the selection circuits much faster. What is the cost to IPC? The machine still retains a signficant OOO component. E.g. even with the top 2 entries blocked by a L2 cache miss, other instructions can be retired from the next 2 entries, which are then refilled from the non-searched part of the queue.
How does simultaneous multithreading (SMT) affect this machine?
Could this functionality be used to improve performance? If so, in which cases, and by how much? Perhaps operating systems are an interesting starting point, since they apparently do a large amount of data copying. This approach does restrict the types of copying that one can do, since it is restricted to entire rows in the DRAM chips. What if we augmented the hardware to not only do copying, but to also quickly zero out a row, or to perform simple boolean operations on two rows. Would this idea work?