15-740 Fall '99
Project Suggestions

Suggestions from Prof. John Shen:

Network Processors
Get a public domain copy of IP routing code and analyze its behavior to suggest how a CPU can be custom designed to execute such code efficiently. Such a CPU may share common attributes with media processors, since both would have to deal with lots of real-time streaming data.
(NOTE: there are many variations on this theme of tailoring an architecture to a specific application domain.)

Suggestions from Prof. Thomas Gross:

System behavior of modern object-oriented programs
The task is to measure various aspects of system behavior on a host that executes Java programs. We are interested in aspects like:
- instruction set usage (op code frequencies)
- branching behavior (branch distances)
- memory system behavior (cache misses, flushes)
- system calls
If possible, such measurements should be made on a PC (x86) although other architectures are also of interest. After a base system for profiling is installed/working, it would be interesting to compare different JVMs, code produced by different Java compilers (SUN JDK vs. IBM Jikes or Symantek's compiler), and to observe/record the effect of JIT compilation.
Even a partial answer (i.e., a project that addresses only a subset of the issues) would still be interesting.
Fast cache behavior/simulations
A number of researchers have observed that some (or many?) instructions of a program always deliver the same value when executed, regardless of the input values given to the program and regardless of the state (e.g., loop counters) that control the execution of an instruction. This property is called ``value locality''. There has also been some work that attempts to exploit this property.
To assess the effectiveness of compiler transformations that exploit value locality, it is necessary to run many simulations. Ideally these simulations run as fast as possible, i.e. use the native instruction set and memory system of the host that executes the simulation. In this project, we are looking for an innovative setup to run such simulation as fast as possible.
Given a number T of threads that are to be executed, and a memory system (consisting of at least one level of caches with a block size of B and a replacement algorithm S), how can we simulute the execution of the T threads for a number of block sizes B1, B2, ..., Bn and a number of replacement strategies S1, S2, ..., Sn as fast as possible. Are there clever memory management techniques that allows us to save some of the work? Note that we need to keep track of the VALUES that are stored in the cache(s) -- that's the major difference to many other cache simulation setups that focus only on the ADDRESSES that are generated by a program.

Suggestions from Prof. Seth Goldstein:

Would an Address Generation Unit (AGU) help a general purpose processor?
Many DSPs and microcontrollers have an "address generation unit." These units allow array index computations to be carried out quickly and in parallel with the computation on array elements. Could such units benefit general purpose CPUs? How would you implement them? Would it be beneficial to extend them to handle more complicated data structures, e.g., trees or graphs? Maybe in combination with prefetching?
Compare Vector Processors, RFU (Reconfigurable Functional Unit) Based Proocessors, and Multiple Processors on a Chip for Media Applications
Using the MediaBench benchmarks, quantitativly measure which is the best architecture under metrics such as power (ops/watt-second), performance (ops/second), etc. This requires finding compilers for these architectures/simulators.
Instruction fetch related power consumption in embedded systems
What is the power consumption for instruction fetch in embedded systems? How can it be reduced? i.e., trace caches? vector processing? reconfigurable? larger caches? VLIW or superscaler?
Architectures for Defect Tolerance
How can an architect help to decrease the cost of fabrication? (I.e., by decreasing effective area) What are the benefits of various archiectural changes? Can you implement them without changing cycle time? For instance, redudant cache lines which are laser mapped in. Are there other simple ways to do this? E.g., multiple function units, multiple processors per chip, a chip which has a bad BTB but still works (without branch predication), etc.?

Suggestions from Earl Killian at Tensilica:

ALUs in different stages
Investigate the effect on an in-order superscalar processor of providing integer ALUs in multiple pipe stages. Instructions use the ALU in the earliest pipe stage that their operands are available.
Traditional 3-issue in-order superscalar:
```
	ALUA in stage 0
	ALUB in stage 0
	BRANCH in stage 0
	AGEN in stage 0
	DCACHE in stage 1
```
The sequence below issues at the labeled times:
```
	r1 = *p;	// issue T=0, use AGEN in 0, DCACHE in 1
	r3 = r1 + r2;	// issue T=2, use ALUA in 2
	r4 = r1 - 1;	// issue T=2, use ALUB in 2
	r5 = r3 < 10;	// issue T=3, use ALUA in 3
	if (r5) goto L;	// issue T=4, use BRANCH in 4
```
Now consider a machine in which ALUs are cheap, so add 1 more, but put the ALUs in different stages. Multi-issue is still expensive enough to limit to 3-way.
```
	ALUA0 in stage 0
	BRANCH0 in stage 0
	AGEN in stage 0
	DCACHE in stage 1
	ALUA1 in stage 1
	BRANCH1 in stage 1
	ALUA2 in stage 2
	BRANCH2 in stage 2
```
The sequence below issues at the labeled times:
```
	r1 = *p;	// issue T=0, use AGEN in 0, DCACHE in 1
	r3 = r1 + r2;	// issue T=0, use ALUA2 in 2
	r4 = r1 - 1;	// issue T=1, use ALUA2 in 3
	r5 = r3 < 10;	// issue T=2, use ALUA1 in 3
	if (r5) goto L;	// issue T=2, use BRANCH2 in 4
```
The processor tends to issue instructions earlier, but do the computation at a similar point in absolute time. So what is the advantage over the traditional approach? By issuing earlier you move on to the next instructions sooner. If you ever find an instruction that wasn't scheduled "soon enough" you get ahead of the traditional approach. Why would such an instruction exist? Because of the restriction of most scheduling to be within a basic block.
Vertical instead of horizontal issue
Current processors try to issue N independent instructions in parallel. I call this horizontal issue based on figure 1. Unfortunately there are often a limited number of independent instructions within a basic block. So instead consider vertical instruction groups, i.e. dependent instructions.
```
    Figure 1. Horizontal issue
    		 Slot
	T	0 1 2 3
	0	A B		A and B are independent
	1	C D E		C, D, and E are independent
	2
	3	F
	4	G H I J
	5	K

    Figure 2. Vertical issue
    		 Slot
	T	0 1 2 3
	0	A C F G		A feeds C feeds F feeds G
	1	B D H		B feeds D feeds H
	2	E I K		E feeds I feeds K
    	3	J		J depends on F
```
Like #1, this might be advantageous primarily for crossing basic block boundaries more gracefully. To build this, you chain your functional units together in successive stages rather than having them operating in parallel. The major stumbling block is that expensive functional units (e.g. dcache) cannot be replicated, and so must go in a fixed place. For example, you might require all load instructions to start a chain.
Fused instructions
One of the breakthroughs for FP applications was the POWER architecture's fused multiply/add instruction. It cut the issue requirement in half, and actually lowered the latency of combined operation at the same time.
Identify other potential fused instruction opportunities. I.e. find candidates where the 3-operand instructions
```
	z = f(x, y)
	w = g(u, v)
```
can be profitably done with a 4-operand fused instruction
```
	d = g(a, f(b, c))
```
Give statistics on the these. Identify which ones can reduce latency. E.g. f=+, g=+ can be implemented with a carry-save stage in front a normal adder, and so be done in a single cycle, instead of 2 cycles.
Fast, small processor-to-processor queues in a cache coherent MP
Investigate the utility of short processor to processor queues for multiprocessing synchronization and data movement (message passing done at a very low level). Reading the queue would be done in parallel with the primary data cache, and so have the same latency as a cache hit. Reading an empty queue would stall the processor until data arrives. Stores to a queue would stall if the queue is full. Cache coherent shared memory would still be available for large data sets; the queues would be used for synchronization and forwarding small amounts of data. The latency from a store to the write side of a queue and a load from the read side of the queue would be many cycles (e.g. 20), but much less than a cache coherent line exchange.
Spin optimization
Investigate the effect of a new instruction to make multiprocessor spinning more efficient. Spinning on memory locations can cause unnecessary cache coherency traffic. For example, a processor spinning waiting for a memory location to change must load the line containing it into its cache (often by interrupting another processor), then have the line yanked away when it changes, and then drag it back.
Consider instead a "load when changed" instruction. Processor X executes load when changed and tries to read the containing line from memory. If the line turns out to be in processor Y's cache instead, and the location's value is not equal to what X specified, it is transferred to X right away. If it does contain X's value, the request is queued in Y, and the next write to the location initiates the line transfer from Y to X. Note that Y can transfer the line early if it wants (e.g. if it runs out of space to queue all the requests it has received).
When Z requests the line from Y and finds X waiting for it, it is forwarded to X. Thus when multiple processors all want a lock, they form a queue in the order they requested the lock. This makes mutual exclusion regions as efficient as a single line transfer from one processor to the next to the next...
Note also that this instruction is useful for multithreaded processors, since it allows the processor to suspend a thread instead of letting it spin.
Out of order instruction issue with limited search
This idea will reduce IPC, but potentially make out of order implementation simpler and higher frequency. So the question is how much does it hurt IPC.
Let's try a concrete example. An OOO processor fetches instructions, renames them, and dumps them into queues in from of functional unit sets (e.g. a 32-entry ALU queue, or a 16-entry branch queue, or a 32-entry load/store queue). 3 ALUs execute the earliest 3 ready instructions from the ALU queue, 2 branch units execute misprediction detection from earliest 2 ready instructions in the branch queue, and 2 load/store units execute the first 2 ready instructions from the load/store queue. So far, just one variety of your basic OOO processor. Now imagine that the search for ready instructions is limited to the top 4 or 8 elements of each queue. This makes the selection circuits much faster. What is the cost to IPC? The machine still retains a signficant OOO component. E.g. even with the top 2 entries blocked by a L2 cache miss, other instructions can be retired from the next 2 entries, which are then refilled from the non-searched part of the queue.
How does simultaneous multithreading (SMT) affect this machine?
Executing instructions before they're ready
Most processors wait until an instruction's operands are ready before sending it to the functional unit. Determining when this is so typically takes several pipeline stages. The primary cost of these pipeline stages is the added delay to cost of branch misprediction. Investigate a machine that executes instructions as soon as they are fetched from the instruction cache, in parallel with determining whether their operands were available. Any instruction that was executed before its time is recirculated and reinjected into the execution units with priority over newly fetched instructions. How often can instructions be executed the first time, and how often do they have to be recirculated?
Predicting branches early
At the start of basic block i, predict the branch at the end of basic blocks i, i+1, i+2, and i+3 using the gshare algorithm and 4 separate 8K entry tables and 10 bits of global history. Call these pred[i,0], pred[i,1], ..., pred[i,3]. How does pred[i,0] compare to pred[i-1,1], pred[i-2,2], and pred[i-3,3] in accuracy? How feasible is it to predict the Nth future branch from the current basic block start address? What is the prediction of the entire path, i, i+1, i+2, i+3? What happens if you use 1 gshare table instead of 4, and only predict every 4th basic block?

Suggestions from Prof. Todd C. Mowry:

Architectures with very long memory access latencies
In the near future, cache misses to main memory are likely to cost thousands of instructions. How would we design a processor for such a scenario? Is there a significant benefit to superscalar processing versus a simpler pipeline if cache miss latencies are so important? How can we design the hardware to help us tolerate such large latencies?
Fast block copy (and other block memory operations) in DRAM
Recall that in a traditional DRAM chip, an entire row of bits are read into a latch upon a RAS signal (subsequent CAS signals are used to access individual bits within this row). Also recall that the latched row values must be written back to the DRAM row after each access, since reading the row is a destructive operation. Imagine that the DRAM chip is modified such that we can specify that the current contents of the latch be written back to an arbitrary row in the DRAM cell. This would be a variation on a RAS signal, which causes a write rather than a read of a row. If you consider that all of the DRAM chips in the system can do this simultaneously, we can potentially copy a large amount of memory in just two DRAM cycles.
Could this functionality be used to improve performance? If so, in which cases, and by how much? Perhaps operating systems are an interesting starting point, since they apparently do a large amount of data copying. This approach does restrict the types of copying that one can do, since it is restricted to entire rows in the DRAM chips. What if we augmented the hardware to not only do copying, but to also quickly zero out a row, or to perform simple boolean operations on two rows. Would this idea work?

Todd Mowry
1999-10-05

15-740 Fall '99 Project Suggestions

15-740 Fall '99
Project Suggestions