Slide 1: PipeRench February 4, 1998
Slide 2: Lab2 Goal of Lab 2: Implement a kernel on the PipeRench architecture.
The given kernels are ought to be workable, and if someone has another
idea for a kernel, it should be discussed with Prof. Goldstein or Dr. Schmit.
Proposals: Proposals for implementation of kernels are due on
Friday.
Slide 3: Reconfigurable Computing Today Applications designed for reconfigurable computing have been geared
for FPGAs. As a result, designs are absolutely constrained to the
size of the chip. Once there are too many CLBs in the design, performance
drops to zero because the design cannot fit onto the chip. Plus,
once new architectural changes are made, the only way to increase the performance
of an existing application is to re-design it.
Slide 4: PipeRench Goals The concept is to introduce a virtual hardware space, similar to a
virtual address space. Programmers can program under the assumption
of an infinite amount of available memory, applications targeted for reconfigurable
devices should be able to be designed with the pretext of an infinite amount
of fabric. The result is a time-multiplexed physical hardware, not
unlike a time-multiplexed physical memory space. This eases the constraint
on size; an application with a larger virtual hardware than available physical
hardware can still execute, though with a performance hit.
By targeting highly pipelined applications, one can develop a model that
is "breakable" into pieces.
Slide 5: Virtualization The architecture needs to maintain a concept of a fundamental piece (analagous to a virtual "page") and be able
to schedule and swap these pages into and out from memory.
Slide 6: Component Reconfiguration Current concept of reconfigurable computing: Component Reconfiguration.
If one had a standard FPGA, and they wanted to execute an application that
was larger than the chip, they would have to partition it, reconfigure
the chip and store off intermediate data. Since reconfiguration of
these chips is on the order of milliseconds, one can either compute
the first part of one value, reconfigure, and compute the last part (poor
throughput), or compute the first part of several values, reconfigure,
compute the last part of the several values (poor latency).
Slide 7: Pipeline Stage Reconfiguration What if the chip could configure one stage of the application while
the previous stage was executing, and reconfigure other pieces as needed?
This would hide the time of reconfiguration of one stage.
Slide 8: Pipeline Stage Reconfiguration The application will execute under the pretense of having enough hardware,
in other words, the hardware stores the state of each stage for for the
application. This is more compiler-oriented.
Slide 9: Pipelined Reconfiguration
Slide 10: Hardware Virtualization
Slide 11: Benefits Applications with smaller performance wins can be implemented faster,
and so their benefits will be noticeable whereas before by the time an
application with only 10X speedup made it to production, hardware technology
had already caught up.
Plus with the hiding of the configuration time, smaller applications
can run as soon as the first stage is configured, thus reducing the overall
latency the application, especially those with small data sets or run times.
PipeRench is also designed to be forward compatible with future designs.
An application developed now will be able to run on future generations
of PipeRench.
Slide 12: Challenges Each stripe can have on the order of several hundreds of bits to configure
every cycle.
Slide 13: Abstract PR Fabric Looks like systolic architectures; it is geared for highly pipelined
designs. The global bus line runs to every stripe, because
as stripes are replaced and restored, the stripe taking the global inputs
can be on any physical stripe in the architecture.
Slide 14: Making It More Concrete Most applications are targeted more toward word sized operations.
Having each of the B D-LUTs be configured identically achieves this word
oriented performance and reduces the number of reconfiguration bits necessary
to reprogram.
Slide 15: PE Based PR Fabric Having multiple PEs within a stripe now requires interconnect support
between PEs and from a previous stripe into a particular PE.
It is currently a full crossbar, but will be made smaller, due to the
space taken up by interconnect.
Slide 16: The PE Since a PE must apply the same function to all input bits, the control
line Dc runs to all LUTs, providing a common input.
Slide 17: Virtualizing the Wires A possible design could have a signal traveling from one stage to several
stages beneath it. This turns out to be a bad thing in this architecture,
because the target stripe or the source stripe might not be in the physical
device when needed. So why not just use a PE to register the value
and retime the design? This wastes processing elements, and more
importantly, it wastes the interconnect resources associated with the PE.
Slide 18: Adding Pass Registers Instead of having one register after a PE, maintain several that directly
pass their values to the next set of registers. This provides the
ability for a PE to pass its value over the next number of PEs, and thus
not waste the resources.
Slide 19: Further Reducing Interconnect By restricting the sources and destinations of signals, the full crossbar
network is greatly reduced. Of note here: a PE can only access a
certain set of Global Bus lines, and a PE can only access a subset of the
previous pass registers.
Slide 20: What is left to specify? After determining the width of the stripe and the width of the PEs,
the number of pass registers, the architectures of the interconnect need
to be determined.
Slide 21: PipeRench
Slide 22: Pipeline Reconfiguration As stated before, in order to accomplish a reconfiguration of one stage,
several bits will need to be shipped from the configuration memory to the
physical chip.
Slide 23: Fabric Architecture
Slide 24: Stripe Configuration
Slide 25: Processing Element Xin, cin, and zin can all be routed to the third input on the LUTs.
Slide 26: Save and Restore If the stripe is going to be saved off of the fabric, the stored values
will also need to be stored away. To cut down on storage bits, only
R0 is stored off.
Slide 27: Save and Restore
Slide 28: Configuration Controller This is a very important aspect of the system. Needs to control
stripe swapping, and external I/O control when I/O stripes are not active.
Slide 29: Memory Access Control
Slide 30: Implementation
Slide 31: Architectural Summary
Slide 32: Pass Registers: DCT
Slide 33: Assembly Language The assembly language was designed to make systolic designs simple,
through the implicit registering, specification of interconnects, multiple
selects of objects and symbolic renaming of objects in the design.
Plus it abstract away some of the parameters of the physical hardware.
Slide 34: Assembly Language Features
Slide 35: Simple Example
Slide 36: Using Symbolic Names By assigning symbolic names to ranges, one can abstract away
the physical numerical references to objects in the definitions.
Scribed by Greg Mann