Scribe Notes
PipeRench February 4, 1998

By Greg Mann
 
Reconfigurable Computing Seminar 2/4/98

Slide 1 Slide 1: PipeRench February 4, 1998

Slide 2 Slide 2: Lab2
Goal of Lab 2: Implement a kernel on the PipeRench architecture.  The given kernels are ought to be workable, and if someone has another idea for a kernel, it should be discussed with Prof. Goldstein or Dr. Schmit.
Proposals: Proposals for implementation of kernels are due on Friday.

Slide 3 Slide 3: Reconfigurable Computing Today
Applications designed for reconfigurable computing have been geared for FPGAs.  As a result, designs are absolutely constrained to the size of the chip.  Once there are too many CLBs in the design, performance drops to zero because the design cannot fit onto the chip.  Plus, once new architectural changes are made, the only way to increase the performance of an existing application is to re-design it.

Slide 4 Slide 4: PipeRench Goals
The concept is to introduce a virtual hardware space, similar to a virtual address space.  Programmers can program under the assumption of an infinite amount of available memory, applications targeted for reconfigurable devices should be able to be designed with the pretext of an infinite amount of fabric.  The result is a time-multiplexed physical hardware, not unlike a time-multiplexed physical memory space.  This eases the constraint on size; an application with a larger virtual hardware than available physical hardware can still execute, though with a performance hit.  By targeting highly pipelined applications, one can develop a model that is "breakable" into pieces.

Slide 5 Slide 5: Virtualization
The architecture needs to maintain a concept of a fundamental piece (analagous to a virtual "page") and be able to schedule and swap these pages into and out from memory.

Slide 6 Slide 6: Component Reconfiguration
Current concept of reconfigurable computing: Component Reconfiguration.  If one had a standard FPGA, and they wanted to execute an application that was larger than the chip, they would have to partition it, reconfigure the chip and store off intermediate data.  Since reconfiguration of these chips is on the order of milliseconds,  one can either compute the first part of one value, reconfigure, and compute the last part (poor throughput), or compute the first part of several values, reconfigure, compute the last part of the several values (poor latency).

Slide 7 Slide 7: Pipeline Stage Reconfiguration
What if the chip could configure one stage of the application while the previous stage was executing, and reconfigure other pieces as needed?
This would hide the time of reconfiguration of one stage.

Slide 8 Slide 8: Pipeline Stage Reconfiguration
The application will execute under the pretense of having enough hardware, in other words, the hardware stores the state of each stage for for the application.  This is more compiler-oriented.

Slide 9 Slide 9: Pipelined Reconfiguration

Slide 10 Slide 10: Hardware Virtualization

Slide 11 Slide 11: Benefits
Applications with smaller performance wins can be implemented faster, and so their benefits will be noticeable whereas before by the time an application with only 10X speedup made it to production, hardware technology had already caught up.
Plus with the hiding of the configuration time, smaller applications can run as soon as the first stage is configured, thus reducing the overall latency the application, especially those with small data sets or run times.
PipeRench is also designed to be forward compatible with future designs.  An application developed now will be able to run on future generations of PipeRench.

Slide 12 Slide 12: Challenges
Each stripe can have on the order of several hundreds of bits to configure every cycle.  

Slide 13 Slide 13: Abstract PR Fabric
Looks like systolic architectures; it is geared for highly pipelined designs.   The global bus line runs to every stripe, because as stripes are replaced and restored, the stripe taking the global inputs can be on any physical stripe in the architecture.

Slide 14 Slide 14: Making It More Concrete
Most applications are targeted more toward word sized operations.  Having each of the B D-LUTs be configured identically achieves this word oriented performance and reduces the number of reconfiguration bits necessary to reprogram. 

Slide 15 Slide 15: PE Based PR Fabric
Having multiple PEs within a stripe now requires interconnect support between PEs and from a previous stripe into a particular PE.
It is currently a full crossbar, but will be made smaller, due to the space taken up by interconnect.

Slide 16 Slide 16: The PE
Since a PE must apply the same function to all input bits, the control line Dc runs to all LUTs, providing a common input.

Slide 17 Slide 17: Virtualizing the Wires
A possible design could have a signal traveling from one stage to several stages beneath it.  This turns out to be a bad thing in this architecture, because the target stripe or the source stripe might not be in the physical device when needed.  So why not just use a PE to register the value and retime the design?  This wastes processing elements, and more importantly, it wastes the interconnect resources associated with the PE. 

Slide 18 Slide 18: Adding Pass Registers
Instead of having one register after a PE, maintain several that directly pass their values to the next set of registers.  This provides the ability for a PE to pass its value over the next number of PEs, and thus not waste the resources.

Slide 19 Slide 19: Further Reducing Interconnect
By restricting the sources and destinations of signals, the full crossbar network is greatly reduced.  Of note here: a PE can only access a certain set of Global Bus lines, and a PE can only access a subset of the previous pass registers.

Slide 20 Slide 20: What is left to specify?
After determining the width of the stripe and the width of the PEs, the number of pass registers, the architectures of the interconnect need to be determined.

Slide 21 Slide 21: PipeRench

Slide 22 Slide 22: Pipeline Reconfiguration
As stated before, in order to accomplish a reconfiguration of one stage, several bits will need to be shipped from the configuration memory to the physical chip.

Slide 23 Slide 23: Fabric Architecture

Slide 24 Slide 24: Stripe Configuration

Slide 25 Slide 25: Processing Element
Xin, cin, and zin can all be routed to the third input on the LUTs.

Slide 26 Slide 26: Save and Restore
If the stripe is going to be saved off of the fabric, the stored values will also need to be stored away.  To cut down on storage bits, only R0 is stored off. 

Slide 27 Slide 27: Save and Restore

Slide 28 Slide 28: Configuration Controller
This is a very important aspect of the system.  Needs to control stripe swapping, and external I/O control when I/O stripes are not active. 

Slide 29 Slide 29: Memory Access Control

Slide 30 Slide 30: Implementation

Slide 31 Slide 31: Architectural Summary

Slide 32 Slide 32: Pass Registers: DCT

Slide 33 Slide 33: Assembly Language
The assembly language was designed to make systolic designs simple, through the implicit registering, specification of interconnects, multiple selects of objects and symbolic renaming of objects in the design.  Plus it abstract away some of the parameters of the physical hardware.

Slide 34 Slide 34: Assembly Language Features

Slide 35 Slide 35: Simple Example

Slide 36 Slide 36: Using Symbolic Names
By assigning symbolic names to ranges,  one can abstract away the physical numerical references to objects in the definitions.

Scribed by Greg Mann