15-740 Fall '00
Project Suggestions

Suggestions from Prof. Seth Goldstein:

Split-phase Abstract Machine
In the near future the cost of chip fabrication will soar along with a many fold increase in transistor densities. In this environment, the ability to configure a circuit after chip fabrication will become increasingly desirable. It will reduce chip manufacturing costs as many users will be able to all use the same set of masks, etc. Thus, a regular, general-purpose, reconfigurable fabric will be needed. However, there is currently no good reconfigurable model that supports general purpose computing and scales, particularly with regard to compilation speed, beyond a few million devices. Here I propose that someone look at an abstract architecture that should make compilation significantly easier. The idea is to partition a program along all split-phase operations. This project would look at the effect of a naive partitioning of programs assuming that the underlying fabric could easily map each portion of the program.
Defect/capability mapping
As transistor densities increase and line-widths decrease the number of fabrication defects will also increase. The defect densities have the potential to completely hobble the industry. Furthermore, for technical reasons, the devices fabricated on the chip will not all have predictable performance. This project proposes to explore how a reconfigurable fabric could have its defects and capabilities mapped post-fabrication. The key is that the mapping algorithm be scalable and produce a sufficiently concise map.
Local memory on PipeRench
The PipeRench architecture is currently limited in the number of algorithms that it can efficiently map due to its restricted I/O and memory model. This project would explore how to add random memory access as a primitive to each stripe in the architecture. We already have a compiler that supports memory access. There is a spectrum of architectures which support both virtualization and stripe based random memory access. At one end is a coherent address space from/to which any stripe can read/write. At the other, are local scratch pad memories to which only 1 stripe has access. There are several applications that we have that can be used as benchmarks.
Configuration swapping and caching
Currently many researchers are exploring how a reconfigurable fabric could be integrated with a processor. No one has yet made a study of the effect of multitasking on such a system. This project would explore how to manage the configuration cache in and between applications.
Test chip
In the future, it is possible that molecular electronics will replace CMOS as the basic chip fabrication technology. Molecular electronics has many benefits (increased densities, decreased power) and many drawbacks (currently no good transistor). I have proposed a basic building block, called a nanoBlock, which the chemists say is buildable. In this project, it is proposed to study whether the nanoBlock would also be a good architecture for deep-submicron (i.e., <100nm) CMOS. This project would be particularly good for someone with VLSI experience, since ST Microelectronics has said they will fabricate a nanoBlock for me if we lay it out.

Suggestions from Prof. Babak Falsafi:

Design a memory hierarchy lookup accelerator.
Memory hierarchies are getting deeper and processor lookups are becoming extremely slow when data is not in the higher levels of the memory hierarchy. Moshovos et al., propose a filtering mechanism to reduce energy consumption in memory hierachy lookups from the bus side in HPCA01. Apply the filtering mechanism to processor lookups to accelerate cache miss handling.
Design an instruction cache block replacement predictor.
Lai and Falsafi proposed a block invalidation predictor in ISCA00. Apply such a predictor or redesign one from scratch to predict instruction replacements in the L1I. Extend the replacement predictor to prefetch instructions into the cache.
Design a power-aware instruction cache by predicting cache subarrays.
Modern caches are layed out in subarrays for speed and layout optimization purposes. L1 caches precharge all cache subarrays to overlap the precharge time. Use a branch-target-buffer to predict the cache subarrays to be accessed and evaluate the power/performance trade-offs of such a cache.

Suggestions from Prof. James Hoe:

Understanding Lifetimes and Utilization of Data
- A simple project.
  Modify simplescalar (or any other simulator) to:
  - track the elapse time from the production of a data value into a register to when it becomes over-written.
  - count the number of usages during a value's life time
  - count the number of use by the same static instruction
  - track elapse time from production to first use
  - track interval between use
  - track elapse time from last use to over-write
  - track the number of live data in the register file at any one time
  and so on. Finally,
  - filter, organize information according to register usages (stack ptr, frame ptr, arg, return value, etc)
  - instead of tracking everything, we might track only the data produced by a small number of selected static instructions.
  - study several benchmarks, while varying compiler optimization options
  - summarize the information in a meaningful way (this is the open-ended aspect of the project)
- A Harder Version:
  - Basically the same as above accept now we track a data value's entire life-time (i.e., following data stores and loads through memory for register spills, copying, etc.).
  - Instead of tracking everything, we might track only the data produced by a user-selected dynamic instruction at a time.
  Plus additional statistics such as
  - track life-time spent active in memory vs. register
  - count the number of spills, replication
- Even Harder:
  - back annotate the information to the source program
  - organize the data according to source-level usage (global var, local var, pointer vs. value, arrays, trees, tables, function arguments, etc)

Suggestions from Prof. John Shen:

Network Processors
Get a public domain copy of IP routing code and analyze its behavior to suggest how a CPU can be custom designed to execute such code efficiently. Such a CPU may share common attributes with media processors, since both would have to deal with lots of real-time streaming data.
(NOTE: there are many variations on this theme of tailoring an architecture to a specific application domain.)

Suggestions from Earl Killian at Tensilica:

ALUs in different stages
Investigate the effect on an in-order superscalar processor of providing integer ALUs in multiple pipe stages. Instructions use the ALU in the earliest pipe stage that their operands are available.
Traditional 3-issue in-order superscalar:
```
	ALUA in stage 0
	ALUB in stage 0
	BRANCH in stage 0
	AGEN in stage 0
	DCACHE in stage 1
```
The sequence below issues at the labeled times:
```
	r1 = *p;	// issue T=0, use AGEN in 0, DCACHE in 1
	r3 = r1 + r2;	// issue T=2, use ALUA in 2
	r4 = r1 - 1;	// issue T=2, use ALUB in 2
	r5 = r3 < 10;	// issue T=3, use ALUA in 3
	if (r5) goto L;	// issue T=4, use BRANCH in 4
```
Now consider a machine in which ALUs are cheap, so add 1 more, but put the ALUs in different stages. Multi-issue is still expensive enough to limit to 3-way.
```
	ALUA0 in stage 0
	BRANCH0 in stage 0
	AGEN in stage 0
	DCACHE in stage 1
	ALUA1 in stage 1
	BRANCH1 in stage 1
	ALUA2 in stage 2
	BRANCH2 in stage 2
```
The sequence below issues at the labeled times:
```
	r1 = *p;	// issue T=0, use AGEN in 0, DCACHE in 1
	r3 = r1 + r2;	// issue T=0, use ALUA2 in 2
	r4 = r1 - 1;	// issue T=1, use ALUA2 in 3
	r5 = r3 < 10;	// issue T=2, use ALUA1 in 3
	if (r5) goto L;	// issue T=2, use BRANCH2 in 4
```
The processor tends to issue instructions earlier, but do the computation at a similar point in absolute time. So what is the advantage over the traditional approach? By issuing earlier you move on to the next instructions sooner. If you ever find an instruction that wasn't scheduled "soon enough" you get ahead of the traditional approach. Why would such an instruction exist? Because of the restriction of most scheduling to be within a basic block.
Vertical instead of horizontal issue
Current processors try to issue N independent instructions in parallel. I call this horizontal issue based on figure 1. Unfortunately there are often a limited number of independent instructions within a basic block. So instead consider vertical instruction groups, i.e. dependent instructions.
```
    Figure 1. Horizontal issue
    		 Slot
	T	0 1 2 3
	0	A B		A and B are independent
	1	C D E		C, D, and E are independent
	2
	3	F
	4	G H I J
	5	K

    Figure 2. Vertical issue
    		 Slot
	T	0 1 2 3
	0	A C F G		A feeds C feeds F feeds G
	1	B D H		B feeds D feeds H
	2	E I K		E feeds I feeds K
    	3	J		J depends on F
```
Like #1, this might be advantageous primarily for crossing basic block boundaries more gracefully. To build this, you chain your functional units together in successive stages rather than having them operating in parallel. The major stumbling block is that expensive functional units (e.g. dcache) cannot be replicated, and so must go in a fixed place. For example, you might require all load instructions to start a chain.
Fast, small processor-to-processor queues in a cache coherent MP
Investigate the utility of short processor to processor queues for multiprocessing synchronization and data movement (message passing done at a very low level). Reading the queue would be done in parallel with the primary data cache, and so have the same latency as a cache hit. Reading an empty queue would stall the processor until data arrives. Stores to a queue would stall if the queue is full. Cache coherent shared memory would still be available for large data sets; the queues would be used for synchronization and forwarding small amounts of data. The latency from a store to the write side of a queue and a load from the read side of the queue would be many cycles (e.g. 20), but much less than a cache coherent line exchange.
Spin optimization
Investigate the effect of a new instruction to make multiprocessor spinning more efficient. Spinning on memory locations can cause unnecessary cache coherency traffic. For example, a processor spinning waiting for a memory location to change must load the line containing it into its cache (often by interrupting another processor), then have the line yanked away when it changes, and then drag it back.
Consider instead a "load when changed" instruction. Processor X executes load when changed and tries to read the containing line from memory. If the line turns out to be in processor Y's cache instead, and the location's value is not equal to what X specified, it is transferred to X right away. If it does contain X's value, the request is queued in Y, and the next write to the location initiates the line transfer from Y to X. Note that Y can transfer the line early if it wants (e.g. if it runs out of space to queue all the requests it has received).
When Z requests the line from Y and finds X waiting for it, it is forwarded to X. Thus when multiple processors all want a lock, they form a queue in the order they requested the lock. This makes mutual exclusion regions as efficient as a single line transfer from one processor to the next to the next...
Note also that this instruction is useful for multithreaded processors, since it allows the processor to suspend a thread instead of letting it spin.
Out of order instruction issue with limited search
This idea will reduce IPC, but potentially make out of order implementation simpler and higher frequency. So the question is how much does it hurt IPC.
Let's try a concrete example. An OOO processor fetches instructions, renames them, and dumps them into queues in from of functional unit sets (e.g. a 32-entry ALU queue, or a 16-entry branch queue, or a 32-entry load/store queue). 3 ALUs execute the earliest 3 ready instructions from the ALU queue, 2 branch units execute misprediction detection from earliest 2 ready instructions in the branch queue, and 2 load/store units execute the first 2 ready instructions from the load/store queue. So far, just one variety of your basic OOO processor. Now imagine that the search for ready instructions is limited to the top 4 or 8 elements of each queue. This makes the selection circuits much faster. What is the cost to IPC? The machine still retains a signficant OOO component. E.g. even with the top 2 entries blocked by a L2 cache miss, other instructions can be retired from the next 2 entries, which are then refilled from the non-searched part of the queue.
How does simultaneous multithreading (SMT) affect this machine?
Executing instructions before they're ready
Most processors wait until an instruction's operands are ready before sending it to the functional unit. Determining when this is so typically takes several pipeline stages. The primary cost of these pipeline stages is the added delay to cost of branch misprediction. Investigate a machine that executes instructions as soon as they are fetched from the instruction cache, in parallel with determining whether their operands were available. Any instruction that was executed before its time is recirculated and reinjected into the execution units with priority over newly fetched instructions. How often can instructions be executed the first time, and how often do they have to be recirculated?

Suggestions from Prof. Todd C. Mowry:

Architectures with very long memory access latencies
In the near future, cache misses to main memory are likely to cost thousands of instructions. How would we design a processor for such a scenario? Is there a significant benefit to superscalar processing versus a simpler pipeline if cache miss latencies are so important? How can we design the hardware to help us tolerate such large latencies?
Fast block copy (and other block memory operations) in DRAM
Recall that in a traditional DRAM chip, an entire row of bits are read into a latch upon a RAS signal (subsequent CAS signals are used to access individual bits within this row). Also recall that the latched row values must be written back to the DRAM row after each access, since reading the row is a destructive operation. Imagine that the DRAM chip is modified such that we can specify that the current contents of the latch be written back to an arbitrary row in the DRAM cell. This would be a variation on a RAS signal, which causes a write rather than a read of a row. If you consider that all of the DRAM chips in the system can do this simultaneously, we can potentially copy a large amount of memory in just two DRAM cycles.
Could this functionality be used to improve performance? If so, in which cases, and by how much? Perhaps operating systems are an interesting starting point, since they apparently do a large amount of data copying. This approach does restrict the types of copying that one can do, since it is restricted to entire rows in the DRAM chips. What if we augmented the hardware to not only do copying, but to also quickly zero out a row, or to perform simple boolean operations on two rows. Would this idea work?

Todd Mowry

15-740 Fall '00 Project Suggestions

15-740 Fall '00
Project Suggestions