## Two High-Throughput Architectures and the Compilers Who Love Them

A Talk for 15745-s07 by Ronit Slyper - rys@cs.cmu.edu Jim McCann - jmccann@cs.cmu.edu

(with diagrams borrowed from IBM and NVIDIA)

























### Two Architectures





## The "



#### On-die PowerPC Processor



## " Portion

#### Your desktop's main CPU





## " Portion



**Device Memory** 

## Operation

#### "Blocks of Threads"



Single-threaded

#### Dual-issue

"Normal"



#### Local Store (256K)

Memory

DMA Engine to copy data to and from < the local store



### Instruction Latencies

| Instruction                          | Pipe | Latency<br>(cycles) |
|--------------------------------------|------|---------------------|
| arithmetic, logical, compare, select | even | 2                   |
| byte sum/diff/average                | even | 4                   |
| shift/rotate                         | even | 4                   |
| float                                | even | 6                   |
| integer multiply-accumulate          | even | 7                   |
| shift/rotate, shuffle, estimate      | odd  | 4                   |
| load, store                          | odd  | 6                   |
| channel                              | odd  | 6                   |
| branch                               | odd  | 1–18                |

| most float, int ops    | 2      |
|------------------------|--------|
| float inv, 1/sqrt, log | 8      |
| int 32-bit mul         | 8      |
| int 24-bit mul         | 2      |
| int div, mod           | "slow" |
| sin, cos, exp, sqrt    | 16     |
| float div              | 18/10  |
| local load/store       | 2      |
| global load/store      | 200+   |
| sync                   | 2      |

Note: latencies sometimes hiddenby swapping thread blocksIBM Cell BENVIDIA G8x

## Two-ish Compilers



Single Source to PPE + SPE
Complicated optimizations Source Code



"Architecture-Neutral Assembly"



Executable on video card

# The Programmer's Job

(Optionally) indicate parallelism.

Explicitly specify and coordinate threads (no compiler help).

Provide a "Single Program" Abstraction

Deal with the Local Store:
fit code on SPE
handle global memory access
prevent instruction starvation
hide memory latency

Provide a "Single Program" Abstraction

Deal with the Local Store:
fit code on SPE
handle global memory access
prevent instruction starvation
hide memory latency

Vectorization Scalar Variable Overhead

Provide a "Single Program" Abstraction

Deal with the Local Store:
fit code on SPE
handle global memory access
prevent instruction starvation
hide memory latency

Vectorization Scalar Variable Overhead Namely, pad slots and put in registers

## Code Partitioning

Greedily collapse functions into same partition (when they fit) -- minimize interpartition calls.



Call Graph -- Edges are call frequencies

## Code Partitioning (runtime)



## Software Cache

Does the hard case right by emulating a data cache. (4-way set associative, I28-byte lines)



32-bit address





BE

### Instruction Starvation



### Instruction Starvation



# Hiding Memory Latency

Simple: Allocate local variables locally.

Tricky: Use array tiling.



# Array Tiling By Example

"Sum A"

sum = 0; for (i = 0; i < 100; ++i) { sum += A[i]; }



### **Before:**



sum = 0; for (i = 0; i < 100; ++i) { Copy A[i] to Local[0]; sum += Local[0]; }

### **Before:**



sum = 0; for (i = 0; i < 100; ++i) { Copy A[i] to Local[0]; sum += Local[0]; }

### **Before:**



sum = 0; for (i = 0; i < 100; ++i) { Copy A[i] to Local[0]; sum += Local[0]; }











## **Double Buffering:**



#### **Double Buffering: Global Memory:** Local Memory: sum = 0;current = Local; next = Local + 5;Copy A[0:4] to current[0:4]; for (j = 5; j < 100; j += 5) { StartCopy A[j:j+4] to next[0:4]; for (i = 0; i < 5; ++i) sum += current[i]; WaitCopy swap(next, current);



## Triple Buffering

When?



### An Observation

### An Observation

### An Observation

## We had plenty to talk about over here...

...not so much over here.

Why?

# Jim's Opinion: Specialized General-Purpose **Applications** Computation Consider the Applications:

## Your Opinion?