#### Reconfigurable Computing

#### Seth Copen Goldstein CMU

Joint work with Herman Schmit, Matt Moe, Mihai Budiu, Srihari Cadambi, Reed Taylor, and Ron Laufer

# Application Base Is Changing Embedded Desktop Abundant parallelism Variable bit widths Variable bit widths Streaming data Demands high performance Compression Optical Flow Optical Flow Synthetic Aperture Radar

© 1999 Seth Copen Goldstein

Filters

ISCA'99 5/1/99

• Search

#### General-Purpose Custom Hardware





#### Why?

|                            | FPGAs           | PipeRench   |
|----------------------------|-----------------|-------------|
| Design Goal                | Glue-logic      | Datapaths   |
| Configuration<br>Time      | Long (> 100 ms) | Almost zero |
| Resources                  | Fixed           | Virtual     |
| Forward<br>Compatibility   | NO              | Yes         |
| Programming<br>Model       | None            | Pipelines   |
| Development<br>Environment | Poor            | Good        |



#### Stream-based Function: FIR



#### **Pipelined Reconfiguration**



#### Hardware Virtualization

More Hardware = More Throughput



Forward Compatibility

#### PipeRench Architecture



ISCA'99 5/1/99

- $\boldsymbol{\cdot}$  On chip configuration cache
- Fabric divided into stripes
- Each stripe is 1 pipeline stage
- Configure 1 stripe in 1 cycle
- Apparent configuration load time is zero!

| The Pi | peRench | Fabric |
|--------|---------|--------|
|--------|---------|--------|

© 1999 Seth Copen Goldstein





© 1999 Seth Copen Goldstein

© 1999 Seth Copen Goldstein

ISCA'99 5/1/99

9

#### The interconnect



#### Evaluating the Design Space

Methodology: Examine the hardware-software interaction

- Parameterize the architecture
  - Stripe is n\*b bits-wide  $64 \le n^*b \le 256$
  - PE is b-bits wide 2 ≤ b ≤ 32
  - PE has r registers  $2 \leq r \leq 16$
- Parameterize the compiler
- Compile kernels for each design point.

#### Hardware Synthesis Flow



#### The Number of Stripes in 50mm<sup>2</sup>

© 1999 Seth Copen Goldstein



ISCA'99 5/1/99

#### The Compiler



#### Throughput



#### PE width tradeoffs

|                             | Narrow | Wide  |
|-----------------------------|--------|-------|
| Utilization                 | High   | Low   |
| Carry Chain Speed           | Slow   | Fast  |
| Configuration Size          | Big    | Small |
| Interconnect<br>Flexibility | More   | Less  |
| Ease of Compilation         | Easy   | Hard  |

#### How Many Registers?









#### Sources of Performance

- Exploit multiple levels of parallelism
  - MIMD, SIMD
  - ILP
  - Pipeline
  - bit-level
- Custom function units
  - Custom sizes
  - Specialized functions
- Improved memory performance
- Data dependent hardware generation



#### What Computer Architects Do

- Given Constraints of:
  - Technology
  - Application
- Use Essential Themes:
  - Exploit locality (AKA caching)
  - Prediction / Speculation
  - Pipelining
  - Parallelism
  - Virtualization / Indirection
  - Specialization
- But, there is progress



#### What We Need To Do

- Decrease design costs
- Improve design verification
- Manufacturing Cost
  - Eliminate mask costs
  - Decrease fab plants costs
  - Increase yield
- Keep compilation time constant
- Invent a new technology

PDL 11/00

#### Goal

- Programmable Logic Device with
  - $\ge 10^{10}$  gate-equivalents/CM<sup>2</sup>
  - $\leq 1 \text{ Watt/CM}^2$
  - $\leq$  nanocents per gate
  - $\ge 10^{11} \text{ ops/sec}$
- Replace ASIC

## Less than 100nm<sup>2</sup> for each gate and all its associated routing

© 2000 Seth Copen Goldstein

#### Electronic Nanotechnology

- 1-30 nanometer features
- Uses electrons, currents, voltages to accomplish computing
- Can build the components we need (diodes, wires, resistors, latches, switches, and maybe transistors)
- Fabrication by directed self-assembly
- Fabrication method implies
  - defects

PDL 11/00

- non-deterministic placement

#### Wires

- 2 30 nm in diameter
- < 2000 nm in length
- good conductors
- excellent current densities
- can be built from
  - organics (carbon nantubes)
  - metals
- Can be both metal or semiconductor



29

#### Diodes

© 2000 Seth Copen Goldsteir

- Conducts current in only one direction
- reasonable on/off ratios
- it has only 2 terminals
- Results from contact between 2 different materials



PDL 11/00

PDL 11/00

31

#### Configurable molecules

- Molecule that can be
  - conducting (ON)
  - insulating (OFF)
- + Turn on with  $V_{dd}\text{+}V_{config}$  Voltage drop
- Turn it off with  $-V_{config}$



#### Synthesis primitives

- Wires can be grown
- Molecules can be made

What is missing:

- No end-to-end alignment of wires
- Nothing more complex than a 2-d mesh
- Gain

PDL 11/00

No complexity!

## Is it enough?

© 2000 Seth Copen Goldsteir

34

### Underlying Technology

- Lots of wires
- Lots of switches
- All programmable

#### Scale implies: •Defects •Randomness

#### A case for Reconfigurable Devices

#### Where Complexity?





#### Computing with a nanoBlock

- The active component is a diode.
- Use diode-resistor logic







#### Requirements for the Matrix

- Good diodes
   e.g., low voltage drop
- Different resistor values
  - e.g., R & 10R, where R  $\sim 10^9~Ohms$

© 2000 Seth Copen Goldstein

- low R wires
- reasonable current densities
- Using Spice (and guesses!):
  - switching at 100Mhz-1Ghz
  - nanoWatts

PDL 11/00











#### The NanoFabric



#### NanoBlock Buildable



| Densiti                                                                                                                                                                                                                                                                                          | es                                                                                                      |    |           | Now what                                                  |    |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|----|-----------|-----------------------------------------------------------|----|
| <ul> <li>Assuming,</li> <li>100nm CMOS process</li> <li>20nm centers for nanow</li> <li>Blocks are 3 input, 8 inte</li> <li>20 long-lines per routing</li> <li>Pretty poor diodes</li> <li>Blocks</li> <li>Configuration bits</li> <li>Avg Configuration Time</li> <li>Power (@10MHz)</li> </ul> | ernal wires<br>channel<br>1.6x10 <sup>8</sup> /Cm <sup>2</sup><br>4.5x10 <sup>10</sup> /Cm <sup>2</sup> |    |           | <ul> <li>Defect discovery</li> <li>Compilation</li> </ul> |    |
| PDL 11/00 © 2000 Seth Copen Goldste                                                                                                                                                                                                                                                              | in                                                                                                      | 49 | PDL 11/00 | © 2000 Seth Copen Goldstein                               | 50 |

#### What Is a Defect?

- Defects in Manufacturing
  - broken wires
  - crossed wires
  - stuck-at switches
  - etc.
- Defects in Knowledge
  - unknown device
  - unknown characteristics of device

#### Think of this as Capability Mapping

#### Mapping the Device

- Need to discover the characteristics of the individual components, but
- Can't selectively stimulate or probe the components
- Download test machines



PDL 11/00



#### Mapping the Device

- Download signature generators
- Then, also detectors
- Then, replicators
- Will scale (at worst) linearly
- For proposed device ~1 day to map



# Requirements to support defect tolerance

- Reconfigurable individual components
- Knowledge of
  - defect free architecture
  - potential characteristics of components
- Digital detection of Characteristics
- Localization of faults

## Architectural Abstractions



PDL 11/00

#### Split-phase Abstract Machine

- A collection of simple cooperating processes
- All potentially long-latency operations are split-phase
  - procedure calls
  - memory operations



#### SAM

© 2000 Seth Copen Goldstein

- Allows compilers to avoid global picture.
- Eliminates need for central control.
- Supports long wires.
- Supports memory hierarchy.
- Can be extended to deal with single-event upsets.

#### Tile Machine

- Map each SAM process to a simple "processor"
  - simple FSM
  - customized circuit
  - local memory
  - static network switch
  - packet network router



PDL 11/00



#### Compilation

- Separate out into SAM processes
- Localize memory
- Partition
- Map to tiles (local place & route)
- Place and route tiles
- Map tiles to NanoBlocks

#### It is sufficient!

- Matrix components:
  - Diodes
  - different resistors
  - different length wires
- Gain provided by latches
- Connections to CMOS wires
- <5% defect rate</p>
- Reconfigurable

#### Conclusions

© 2000 Seth Copen Goldstein

- Can build regular structures
- Exploits reconfigurability to provide
  - Built-in self test
  - Custom circuits
- nanoFabric is scalable
  - Low power
  - MHz

PDL 11/00

- Tera-components

63

#### Chip Layout



#### ISCA'99 5/1/99

© 1999 Seth Copen Goldstein

65

ISCA'99 5/1/99

© 1999 Seth Copen Goldstein

#### 66

#### PipeRench Tile



#### Without M3 or M4









#### Why Not?

## Field Programmable Gate Arrays are not suited for general-purpose custom hardware.

- Fixed resources
- No forward compatibility
- Oriented towards glue-logic
- Long configuration times
- No programming model
- Poor Development Environment

ISCA'99 5/1/99

#### The PipeRench Fabric



| Why Not?        |                                                               |
|-----------------|---------------------------------------------------------------|
| FPGAs           | PipeRench                                                     |
| Glue-logic      | Datapaths                                                     |
| Long (> 100 ms) | Almost zero                                                   |
| Fixed           | Virtual                                                       |
| NO              | Yes                                                           |
| None            | Pipelines                                                     |
| Poor            | Good                                                          |
|                 | FPGAs<br>Glue-logic<br>Long (> 100 ms)<br>Fixed<br>NO<br>None |

#### **Future Requirements**

- Embedded Computing
  - COTS part
  - ↓ Manufacturing Cost - Easy programming model
  - Field programmable
- $\downarrow$  Design Cost ↑ Flexibility
- Hardware performance
- General-Purpose Computing
  - Increased Performance
  - Respect the memory bandwidth gap
- Both
  - Variable bit width operations
  - Exploit all levels of parallelism
  - Use replication

Finding the Best Instance



- Number of PEs per stripe n
- Bit width of each PE b
- The number of registers per PE р
- The type of interconnect
- The internals of a PE

#### Space is Time

- Larger configurations
  - $\Rightarrow$  more virtualization
  - $\Rightarrow$  lower throughput
- Tradeoff between circuit delay and circuit size is direct
- E.g., Faster carry chains
  - $\Rightarrow$  wider additions per clock cycle
  - $\Rightarrow \text{fewer stripes}$
  - $\Rightarrow$  better performance

#### PE width tradeoffs

|                             | Narrow | Wide |
|-----------------------------|--------|------|
| Utilization                 | 1      | -    |
| Carry Chain Speed           | -      | 1    |
| Configuration Size          | •      | +    |
| Interconnect<br>Flexibility | 1      | +    |
| Ease of Compilation         |        | •    |

ISCA'99 5/1/99 (1999 Seth Copen Goldstein 77 ISCA'99 5/1/99 (1999 Seth Copen Goldstein 78