CS 849C Lecture 2

First, a review of the matter from the last lecture, which focused on computer architecture. The main ideas there were:

Restoring logic, robust devices
Parallelism (pipelining, superscalar, VLIW)
Locality (memory hierarchies, caches)

Review of computer architecture

A simplified representation of the organization of a modern microprocessor is shown in Figure 1. A simplified description of the components of a modern microprocessor is that they consist of (1) Functional units which perform the actual useful operations (2) Registers which hold intermediate values between computations (3) Control logic that ties together the operation of the various pieces and usually (4) Cache memory which caches frequently used data so that costly accesses to main memory are prevented for frequently used data.

The processor is interfaced to a memory that holds both instructions and data, and it fetches instructions from memory and writes the results back to memory, through a single interface. This organization of processor, memory and interconnection is referred to as the von Neumann (or Princeton) architecture. An alternative organisation is to have separate program and data memory or separate program and data buses. This is often referred to as the Harvard architecture.

The simplest technique for improving performance through hardware parallelism is pipelining. Pipelining consists of breaking up the operations to be performed into simpler independent operations, sort of like the breaking up the operations of assembling a car in an assembly line. A typical processor pipeline consists of 4 pipeline stages (1) Instruction fetch, (2) Instruction decode (3) Instruction execute and (4) Register file write-back/memory access. In practice however, real architectures have many more physical pipeline stages, with multiple physical stages corresponding to one of the above stages. For example the execute stage might occupy 4 physical pipeline stages

The primary advantages of pipelining are

Parallelism
Smaller cycles time [smaller number of equivalent gates (FO4) per stage] - Electrons don't have to go from memory to control to execute to write back all in one cycle, they now just have to go from, say control to register file or register file to exec, and they only have to travel within the confines of one pipeline stage during each cycle.

There is however overhead in doing pipelining, both in terms of performance (extra delay interfacing with pipeline latches) and area (pipeline latches).

Caches are the other big thing done in the last 2 decades to improve performance. Keep things locally if they are going to be used soon. From a physics point of view, an access to memory which is almost exclusively off chip, will mean signals have to travel that much further in one cycle. [In practice, since we do not want to make the system as slow as its slowest component (memory), and the cycle time is not determined by the memory access time, and rather memory takes several cycles to complete].

Pipelining and caches are the big revolution in the last 20 years. Caches are now on-chip. It is cheaper to be on chip, and there is a tremendous amount of bandwidth, since we do not have to go out through the processor pins, but rather all signals travel on the on-chip wires which take advantage of device integration via photo-lithography.

In general to load from cache [first level cache] takes 1-2 cycles, and to load from memory takes 50-500 cycles. If we have to wait 50 cycles, then the whole pipeline stalls, and we loose a lot of the parallelism we could have gotten from pipelining.

So we got parallelism from pipelining. Two other approaches are

Superscalar
VLIW

Code example:


	a = (e << 3) + c;
	b = a + f;
	c = 5;
	d = n ^ 0xffff;
	e = n*n;

If we are going to execute them, the 3rd, 4th and 5th can be executed independently of the first. In a superscalar processor, it exploits what is called Instruction Level parallelism (ILP). It takes a bunch of instructions and loads them all at once. These instructions are taken out of a window of instructions, typically a window of about 16 instructions (issue window, reservation stations). Having a larger window means we have a larger scope from which the processor will try to find independent instructions, and execute them as soon as possible. Now however, the register file gets much bigger (since we now need more registers to be able to do more things in parallel).

Instructions are executed in parallel if they are independent. The neat thing about superscalar architectures is that you take old program and without re-compilation you can run faster without requiring extra effort from the point of view of the programmer.

An other option is VLIW. Here, we do not try to do all the figuring out on the fly in hardware. The processor is now a lot simpler, and the complexity is moved to the compiler. Now instead of having 4 separate instructions with dependences, the compiler arranges the program so that the instructions sent to the processor are bundles (single instructions) containing, say 4 independent operations, which are fetched at the same time, and executed at the same time etc. The compiler now determines what has to be done in parallel.

The difference between VLIW and superscalar:

VLIW:
Hardware is simple and fast and easy to verify. If compiler can't figure out cases that will be independent, e.g.:
```
	if (x < 10)
	{
		foo();
	}
	else
	{
		bar();
	}
	
```
Assuming foo and bar do not have any strange side effects, then the compiler could schedule to have the hardware in the VLIW machine execute foo and bar at the same time, and when the condition of (x < 10) is known, use the appropriate result and drop the other one on the floor. This is called speculation [using predication to predicate the instructions based on some result]. A VLIW architecture is crippled without a good compiler.
Superscalar:
Superscalar architectures use runtime information to find parallelism in the execution stream. They do at runtime in hardware what the compiler does at compile time in VLIW architectures. Hardware is a lot more complex, as it has to have hardware to do dynamic determination of dependences.

Other types of parallelism:

(Flynn classification, SIMD, MIMD, SISD, MISD)
MIMD (message passing) these failed because they could not scale with Moore's law, due to lots of custom hardware
SIMD - one instruction, lots of data elements - e.g. have an array and want to increment all the elements at once. Very limited programming model
Vector Processor - vector math add one vector (array) of data to another vector (array) of data.

These architectures may not be sufficient to provide the continued increases in performance that we have come to expect from microprocessors. One of the reasons is process technology and the other is economics. As we go to sub 100 nanometer technology, fabricating these architectures in silicon becomes so much harder. For example, in superscalar architectures with a large complex register file, we now have wires that have to run a much larger length across the chip. These lines at these small process dimensions begin to behave like transmission lines as opposed to perfect conductors. They get slow compared to transistors, and they do not scale as well as transistors. The on-chip interconnect then begins to become a limit to performance.

Reconfigurable computing

The main idea: Take programs and run them thru a compiler. But instead of getting a list of instructions, we get a circuit. Thus, into the compiler goes our program and out of the compiler comes custom hardware. Now, we would not like to just take that circuit and have it implemented in silicon, because then as you might imagine, debugging would be very expensive! Instead what is done is you take that circuit description of your program and load it into a reconfigurable fabric, like an FPGA or CPLD.

The way an FPGA works:

Configuration for FPGA is held in a separate configuration memory. Any function we desire can be implemented by storing an appropriate configuration in the configuration memory, which determines how the logic elements are wired up.

Now, going back to ILP. If we had a program with say 4 operations that could be carried out in parallel, then when our program is compiled to hardware, we have hardware to perform each of those operations in parallel. Going back to the case of the superscalar and VLIW machines, the hardware is fixed, so if for example we had 2 functional units, then even though our program has more ILP, we will not be able to exploit it. Basically, in designing a general purpose processor, a tradeoff is made by the designers as to how many functional units of different types to include for the general case. For example on an architecture we might have two integer ALUs, a shifter, a load-store unit, an integer multiplier a floating point ALU. If there is some application that would benefit significantly from having 4 integer ALUs and has no requirement for floating point hardware, then there is nothing we can do, since the hardware is general (and fixed).

In reconfigurable hardware, rather than having to implement wires going to memory etc, we could just map the output of one operations hardware to the input of another. There are also many things that are expensive to do in software but easy to do in hardware:

Population count in software
Bit reversal - just map the wires
Custom bit-width hardware
Have the hardware adapt to data input -- E.g. have custom hardware generated for specific data inputs.

Nanofabrics paper

The primary message of the paper is that the method of putting these devices together is different from photo-lithography, and because of the inherently different nature of the technology, we need a different architecture.

Question - how do you implement a memory cell ?

The architecture as it is described in the paper does not describe what a memory cell will look like. What is done is when we want to store a data value, e.g. a logic '1' then we configure the molecule on and vice versa. To read data, we address the line, and read off the voltage at the site. What is missing is how we can configure these devices on/off quickly, and how do we access the multiplexers to address the data.

The reason why this is hard is that the wires are e.g. 10nm apart, but now how do we address an individual wire ? one way would be to have some kind of a multiplexer, taking in log(n) wires and having n wires out, with incoming control lines. So we would go from fat wires to nanoscale. The multiplexers would have to be 10x bigger than a nanoblock. So rather we address the wires in the time domain. A method that will be difficult to get right (because of the need for precise process controls) is as follows:

Given the logic array if you look at the diagram for the nanoblock in Figure 2 below each one of the hashed regions is implemented with photo-lithography, and we want to minimize the number of these per nanoblock because the photo-lithography has rules and restriction on how close we can put these photo-lithographically made blocks together, like a 'cheap' 100nm process, so we are getting say 1.4um on the side of a nanoblock if we want to have 7 of these per side. Imagine the red wire is a big resistor. we run a high voltage between the 2 nodes to configure it. The way to get the high voltage across it, say 10V, we put 5 volts on one side and -5 volts on the other side, so there is a gradient of voltage across the wire, and there is only one position that has +10V. So we start off with a very high voltage, applied to the entire array. and slowly turn voltage and leave behind those that we want to program, thus using time multiplexing as opposed to space multiplexing. This approach does not work. One of the reasons is that you have to get the process tolerances just right.

The wires are made of carbon nanotubes which are essentially single molecules about 1.2 nm in diameter. They have a conductivity of the square of the conductivity of copper, at the nanometer scale. With copper at that small dimension we get into grain size issues etc.

Molecular Switches

The currently proposed molecules for molecular switches are made of rotaxane

If the ring (orange in Figure) moves to the right, the switch is closed and if the ring is to the left, it is open.

FPGAs require a separate configuration memory. We have all these extra wires from the memory cells with of the order of 10-100x overhead. With the reconfigurable molecular switch on the other hand, the memory/state is where the ring is, so it holds its own state and we can also program it with the same wires we use to read data.

We use electricity to cause a conformational change in the molecule (movement of the ring). This programming does not happen quickly, as it involves molecular changes. Though current molecular electronic devices may seem a little primitive, technologies for fabricating these molecular electronic devices are constantly improving.

A big issue however, is the contact resistance between the nanoscale and the photo-lithographically fabricated devices. It is of the order of a giga-Ohm and tera-Ohm. Creating a truly ohmic contact is hard. There have been some claims to success in this, but do not know of any published results.

The NanoFabric architecture

Nanoscale devices must still interface with the outside world, which is dominated by photo-lithographically manufactured devices. Even within a nanotech-based device, the device dimensions are restricted by the minimum feature sizes at the micro-scale.

SE block

Input and outputs are nanoscale wires. It is easy to have the wires cross, but rather difficult to connect them end-to-end, inputs on one side and outputs on the other side. To make connections between blocks, we float in the nanoblocks and get them to settle into pits in a substrate. Same thing happens for the NW blocks. There is a reasonable amount of slop in how accurate the placement of each nanoblock is, as can be seen from the Figure below.

The wires are probably originally coated, and after we do the float, we do a dissolve step. This would make it possible to have some coating to prevent the wires from breaking during the floating in.

In connecting nanoblocks, there is restriction on which blocks can be interconnected. Due to the arrangement of inputs and outputs and the alignment of nanoblocks, each SE block can only send outputs to the NW block to its upper right hand, and to the SE block to its lower left. All other communications will therefore be going through intermediate SE and NW blocks.

Is this layout sufficient to map any circuit onto it (as we can do for FPGAs as previously mentioned) ? We can think of this layout as a 2-dimensional mesh. It will takes 4 hops to get across, which slows communication form A to B down by a factor of 4.

Defect Tolerance

When making a complex irregular structure, just take out one piece and the whole thing is broken. If you have regularity in the underlying structure, then we can program around the defects. Imagine you are programming your MLA to implement a half-adder, we can do things to avoid the defect. Finding the defects is however hard, since we cannot go in and address independent wires. The general approach is that you test enough of it externally, so that it can then test itself.

Since we cant go in a look in at any individual element, what is instead done is we for example implement a Linear Feedback Shift Register (LFSR) on the fabric. The LFSR is essentially a pseudo-random number generator, that will generate a known sequence of pseudo random outputs. Several LFSRs are implemented in the grid in the fashion shown in the figure [insert figure from notes]. Now, we crank each of these LFSRs that are mapped on different (though overlapping) parts of the fabric and check the results. If there is a defect in any part of the fabric onto which an LFSR is mapped, the generated outputs will begin to diverge from the correct sequence. So using the outputs from the diffenrt LFSRs mapped to the fabric, we can narrow down on which are contains the defect.