| 15-745                                  | Instruction-level Parallelism                                                                                                |
|-----------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| 10-7-10                                 | <ul> <li>Most modern processors have the ability to<br/>execute several adjacent instructions<br/>simultaneously.</li> </ul> |
| Instruction Scheduling                  | – Pipelined machines.                                                                                                        |
|                                         | - Very-long-instruction-word machines (VLIW).                                                                                |
|                                         | - Superscalar machines.                                                                                                      |
|                                         | - Dynamic scheduling/out-of-order machines.                                                                                  |
|                                         | <ul> <li>ILP is limited by several kinds of <i>execution</i><br/>constraints.</li> </ul>                                     |
|                                         | - Data dependence constraints.                                                                                               |
| Copyright © Seth Copen Goldstein 2000-5 | - Resource constraints ("hazards")                                                                                           |
| copyright o bern copen boldstein 2000-5 | - Control hazards                                                                                                            |
| (some slides borrowed from M. Voss)     | 15-745 ⊕ Seth Copen Goldstein 2000-5                                                                                         |
|                                         | 1                                                                                                                            |

## **Execution Constraints**

- Data-dependence constraints:
  - If instruction A computes a value that is read by instruction B, then B cannot execute before A is completed.
- Resource hazards:
- For example:
- Limited # of function
- ld [%fp-28], %o1 add %o1, %l2, %l3

3

- If there are n functional u multipliers), then only n insofunit can execute at once.
- Limited instruction issue.
  - If the instruction-issue unit can issue only *n* instructions at a time, then this limits ILP.
- Limited register set.
  - Any schedule of instructions must have a valid register allocation.

# Instruction Scheduling

- The purpose of instruction scheduling (IS) is to order the instructions for maximum ILP.
  - Keep all resources busy every cycle.
  - If necessary, eliminate data dependences and resource hazards to accomplish this.
- The IS problem is NP-complete (and bad in practice).
  - So heuristic methods are necessary.

#### How can you tell this is an old slide?

## Instruction Scheduling

- There are *many* different techniques for IS.
  - Still an open area of research.
- Most optimizing compilers perform good local IS, and only simple global IS.
- The biggest opportunities are in scheduling the code for loops.

# Should the Compiler Do IS?

- Many modern machines perform dynamic reordering of instructions.
  - Also called "out-of-order execution" (OOOE).
  - Not yet clear whether this is a good idea.
  - Pro:
    - OOOE can use additional registers and register renaming to eliminate data dependences that no amount of static IS can accomplish.
    - No need to recompile programs when hardware changes.
  - Con:

5

7

- OOOE means more complex hardware (and thus longer cycle times and more wattage).
- And can't be optimal since IS is NP-complete.

What we will cover

15-745 © Seth Copen Goldstein 2000-5

- Scheduling basic blocks
  - List scheduling
  - Long-latency operations
  - Delay slots
- Scheduling for clusters architectures
- Software Pipelining (next week)
- What we need to know
  - pipeline structure
  - data dependencies
  - register renaming

Instruction Scheduling

15-745 © Seth Copen Goldstein 2000-5

In the von Neumann model of execution an instruction starts only after its predecessor completes.



- This is not a very efficient model of execution.
  - von Neumann bottleneck or the memory wall.

#### **Instruction Pipelines**

- Almost all processors today use instructions pipelines to allow overlap of instructions (Pentium 4 has a 20 stage pipeline!!!).
- The execution of an instruction is divided into stages; each stage is performed by a separate part of the processor.



- **F:** Fetch instruction from cache or memory.
- D: Decode instruction.
- E: Execute. ALU operation or address calculation.
- M: Memory access.
- W: Write back result into register.
- Each of these stages completes its operation in one cycle (shorter the the cycle in the von Neumann model).
- An instruction still takes the same time to execute.

15-745 © Seth Copen Goldstein 2000-5

9

11

#### **Pipeline Hazards**

- Structural Hazards
  - two instructions need the same resource at the same time
  - memory or functional units in a superscalar.
- Data Hazards
  - an instructions needs the results of a previous instruction

```
r1 = r2 + r3
r4 = r1 + r1
```

r1 = [r2] r4 = r1 + r1

- solved by forwarding and/or stalling
- cache miss?
- Control Hazards
  - jump & branch address not known until later in pipeline
  - solved by delay slot and/or prediction

#### **Instruction** Pipelines

• However, we overlap these stages in time to complete an instruction every cycle.



#### Jump/Branch Delay Slot(s)

· Control hazards, i.e. jump/branch instructions.

unconditional jump address available only after Decode. conditional branch address available only after Execute.

| jump/branch | F | D | Е | М | W |   |   |   |
|-------------|---|---|---|---|---|---|---|---|
| instr 2     |   | F | D | Ε | Μ | W | ] |   |
| instr 3     |   |   | F | D | Ε | Μ | W | ] |
| instr 4     |   |   |   | F | D | Е | М | W |

#### Jump/Branch Delay Slot(s)

• One option is to stall the pipeline (hardware solution).



Another option is to insert a no-op instructions (software)



13

15

Both degrade performance!

#### Jump/Branch Delay Slots

15-745 © Seth Copen Goldstein 2000-5

- In other words, the instruction(s) in the delay slots of the jump/branch instruction always get(s) executed when the branch is executed (regardless of the branch result).
- Fetching from the branch target begins only after these instructions complete.



What instruction(s) to use?

#### Jump/Branch Delay Slot(s)

- another option is for the branch take effect after the delay slots.
- I.e., some instructions always get executed after the branch but before the branching takes effect.



15-745 © Seth Copen Goldstein 2000-5

#### **Branch** Prediction

- Current processors will speculatively execute at conditional branches
  - if a branch direction is correctly guessed, great!
  - if not, the pipeline is flushed before instructions commit (WB).
- Why not just let compiler schedule?
  - The average number of instructions per basic block in typical C code is about 5 instructions.
  - branches are not statically predictable
  - What happens if you have a 20 stage pipeline?

15-745 © Seth Copen Goldstein 2000-5



19

#### **Example Dependencies**

| S1)a=0;      |                                    |          | (1) |
|--------------|------------------------------------|----------|-----|
| S2) b=a;     |                                    |          | T   |
| S3) c=a+d+e; |                                    |          |     |
| S4)d=b;      |                                    |          |     |
| S5)b=5+e;    | S1 $\delta^{f}$ S2                 | due to a |     |
|              | S1 $\delta^{f}$ S3                 | due to a | 3)  |
|              | S2 $\delta^{f}$ S4                 | due to b |     |
|              | <b>S3</b> δ <sup>a</sup> <b>S4</b> | due to d |     |
|              | S4 $\delta^a$ S5                   | due to b | 4   |
|              | S2 δº S5                           | due to b | t   |
|              | <b>S3</b> δ <sup>i</sup> <b>S5</b> | due to a | 5   |

#### **Renaming of Variables**

- Sometimes constraints are not "real," in the sense that a simple renaming of variables/registers can eliminate them.
  - Output dependence (WW): A and B write to the same variable.
  - Anti dependence (RW): A reads from a variable to which B writes.
- In such cases, the order of A and B cannot be changed unless variables are renamed.
  - Can sometimes be done by the hardware, to a limited extent.



#### Scheduling a BB

| <ul> <li>Assume</li> </ul> | : |
|----------------------------|---|
|----------------------------|---|

·load 5

store 5

- mult 2
- others 1
- operations are nonblocking

| •  | $x \leftarrow w$ | * 2 * x * y * z |
|----|------------------|-----------------|
| 1  | r1               | ← [fp+w]        |
| 2  | r2               | <b>← 2</b>      |
| 6  | r1               | ← r1 * r2       |
| 7  | r2               | ← [fp+x]        |
| 12 | r1               | ← r1 * r2       |
| 13 | r2               | ← [fp+y]        |
| 18 | r1               | ← r1 * r2       |
| 19 | r2               | ← [fp+z]        |
| 24 | r1               | ← r1 * r2       |
| 26 | [fp+x]           | ← r1            |
| 33 | r1 can l         | be used again   |

#### We can do better

- Assume:
  load 5
  store 5
  mult 2
  - others 1
  - operations are nonblocking

We can do even better if we assume what?

| 1  | r1     | ← [fp+w]             |
|----|--------|----------------------|
| 2  | r2     | ← [fp+x]             |
| 3  | r3     | ← [fp+y]             |
| 4  | r4     | ← [fp+z]             |
| 5  | r5     | ← 2                  |
| 6  | r1     | ← r1 * r5            |
| 8  | r1     | $\leftarrow$ r1 * r2 |
| 10 | r1     | ← r1 * r3            |
| 12 | r1     | ← r1 * r4            |
| 14 | [fp+w] | ← r1                 |

19 r1 can be used again

| $\begin{array}{ccccc} & 1 & r1 & \leftarrow [fp+w] \\ 2 & r2 & \leftarrow 2 \\ 6 & r1 & \leftarrow r1 * r2 \\ 7 & r2 & \leftarrow [fp+x] \\ 12 & r1 & \leftarrow r1 * r2 \\ 13 & r2 & \leftarrow [fp+y] \\ 18 & r1 & \leftarrow r1 * r2 \\ 19 & r2 & \leftarrow [fp+z] \\ 24 & r1 & \leftarrow r1 * r2 \\ 26 & [fp+w] & \leftarrow r1 \\ 33 & r1 & can be used again \\ \end{array}$ | <b>9 Better</b><br>1 r1 $\leftarrow$ [fp+w]<br>2 r2 $\leftarrow$ [fp+x]<br>3 r3 $\leftarrow$ [fp+y]<br>4 r4 $\leftarrow$ [fp+z]<br>5 r5 $\leftarrow$ 2<br>6 r1 $\leftarrow$ r1 * r5<br>8 r1 $\leftarrow$ r1 * r5<br>8 r1 $\leftarrow$ r1 * r4<br>10 r1 $\leftarrow$ r1 * r4<br>14 [fp+w] $\leftarrow$ r1<br>19 r1 can be used again | <ul> <li>The Scheduler</li> <li>Given: <ul> <li>Code to schedule</li> <li>Resources available (FU and # of Reg)</li> <li>Latencies of instructions</li> </ul> </li> <li>Goal: <ul> <li>Correct code</li> <li>Better code [fewer cycles, less power, fewer registers,]</li> <li>Do it guickly</li> </ul> </li> </ul>                                                                                  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| More A<br>• Given a graph $G = (V, E)$<br>• nodes are operation<br>• Each operation has an as<br>- edges between node<br>- The number of reso<br>• A schedule assigns to<br>- S(n) $\ge 0$<br>- If (n,m) $\in G$ , S(m) $\ge$<br>- $ \{n \mid S(n) = x \text{ and } t\}$                                                                                                             | s<br>sociated delay and type<br>es represent dependencies<br>ources of type t, R(t)<br>each node a cycle number:<br>s S(n) + delay(n)                                                                                                                                                                                               | <section-header>DescriptionList Scheduling• Keep a list of available instructions, I.e.,• If we are at cycle k, then all predecessors, p,<br/>in graph have all been scheduled so that<br/>S(p)+delay(p) ≤ k• Pick some instruction, n, from queue such that<br/>there are resources for type(n)• Update available instructions and continue• It is all in how we pick instructions</section-header> |

#### **DLS (1995)** Lots of Heuristics forward or backward • Aim: avoid pipeline hazards in load/store unit choose instructions on critical path - load followed by use of target reg - store followed by load ASAP or ALAP Balanced paths Simplifies in two ways - 1 cycle latency for load/store • depth in schedule graph - includes all dependencies (WaW included) 15-745 © Seth Copen Goldstein 2000-5 29 15-745 © Seth Copen Goldstein 2000-5 30 The algorithm Construct Scheduling dag 1) ld r1 $\leftarrow$ [a] • Make srcs of dag candidates 2) Id $r2 \leftarrow [b]$ • Pick a candidate add r1 $\leftarrow$ r1 + r2 3) - Choose an instruction with an interlock ld r2 ←[c] 4) - Choose an instruction with a large number of 5) ld $r3 \leftarrow [d]$ successors mul r4 $\leftarrow$ r2 \* r3 6) - Choose with longest path to root add r1 $\leftarrow$ r1 + r4 7) Add newly available instruction to candidate list add r2 $\leftarrow$ r2 + r3 8) 9) mul r2 $\leftarrow$ r2 \* r3 10) add r1 $\leftarrow$ r1 + r2 11) st $[a] \leftarrow r1$

#### Trace Scheduling

- Basic blocks typically contain a small number of instrs.
- With many FUs, we may not be able to keep all the units busy with just the instructions of a BB.
- Trace scheduling allows block scheduling across BBs.
- The basic idea is to dynamically determine which blocks are executed more frequently. The set of such BBs is called a trace.



The trace is then scheduled as a single BB.

Blocks that are not part of the trace must be modified to restore program semantics if/when execution goes off-trace.

15-745 © Seth Copen Goldstein 2000-5

#### Trace Scheduling





#### VLIW

15-745 © Seth Copen Goldstein 2000-5

- Very Long Instruction Word
- Multiple Function Units
- Statically scheduled
- Examples

33

35

- Itanium
- TI C6x

| Memory A | LU FPU | Branch | ALU | ••• |
|----------|--------|--------|-----|-----|
|----------|--------|--------|-----|-----|

• Scalability Issues?

## Why Clusters?

- Reduce number of register ports
- Reduce length of buses
- Example: C6x



# Phase-Ordering

- Valid assembly must
  - be properly partioned
  - be properly scheduled
  - be properly register allocated
- What order to perform?





- Not all FUs the same
  - -some overlap
  - Add on L.S.D
  - -delay 1 mostly
  - -load, 4. mult, 2
- Not all srcs the same
  - -both srcs from same RF
  - -L,S,M: 1 from other RF -Only L&S used for copy Data Path B
- -S &M: only right from other RF
- Not all dests the same
  - -if 2 D ops, srcs and dests must be to different RFs



## Partitioning/Scheduling Basics

Data Path A

- · Objectives:
  - Balance workload per cluster
  - Minimize critical intercluster communication



39

## Bottom-Up Greedy (BUG) 1985

- Assigns operations to cluster, then schedules
- recurses down DFG
  - assigns ops to cluster based on estimates of resource usage
  - assigns ops on critical path first
  - tries to schedule as early as possible



## **Integrated Approaches**

- Leupers, 2000
  - combine partitioning & scheduling
  - iterative approach
- B-init/B-iter, 2002
  - Initial binding/scheduling
  - Iterative improvement
- RHOP, 2003
  - region-based graph partitioning

#### Leupers Approach

- Integrate partitioning and scheduling
- Use Simulated Annealing to determine partition
- The eval step in the SA loop is the scheduler!
- Deals with details of architecture

#### **Example Result**

15-745 © Seth Copen Goldstein 2000-5



43

41

# Approach: SA



# **Basic Algorthm**

algorithm Partition input DFG G with nodes; output: DP: array [1..N] of 0,1 ; var int i, r, cost, mincost; float T; begin T=10; P:=Randompartitioning; mincost := LISTSCHEDULING(G,P); WHILE\_LOOP; return DP; end.  $\label{eq:while_loop:} WHILE_LOOP: \\ \mbox{while T>0.01 do} \\ \mbox{for i=1 to 50 do} \\ \mbox{r:= RANDOM(1,n);} \\ P[r] := 1-P[r]; \\ \mbox{cost:=LISTSCHEDULING(G,P);} \\ \mbox{delta:=cost-mincost;} \\ \mbox{if delta <0 or} \\ \mbox{RANDOM(0,1)<exp(-delta/T)} \\ \mbox{then mincost:=cost} \\ \mbox{else P[r]:=1-P[r]} \\ \mbox{end for;} \\ \mbox{T:= 0.9 * T;} \\ \mbox{end while;} \\ \end while; \\ \end while; \\ \end mathbf{end for}; \\ \mbox{T:= 0.9 * T;} \\ \mbox{end while;} \\ \end while; \\ \end mathbf{end for}; \\ \mbox{T:= 0.9 * T;} \\ \end while; \\ \end while; \\ \end mathbf{end for}; \\ \mbox{T:= 0.9 * T;} \\ \end while; \\ \end mathbf{end for}; \\ \end mathbf{end for}; \\ \end while; \\ \end mathbf{end for}; \\ \end mathbf{end for}; \\ \end while; \\ \end mathbf{end for}; \\ \end while; \\ \end mathbf{end for}; \\ \end mathbf{end for}$ 

48

15-745 © Seth Copen Goldstein 2000-5

# Scheduling

- Use a List Scheduler
- Tie breaker for next ready node is min ALAP
- Heart of routine is ScheduleNode

```
algorithm ListScheduling(G,P)

input DFG G; parition P;

output: length of schedule

var m: DFG node; S: schedule

begin

mark all nodes unscheduled

S = \emptyset

while (while not all scheduled) do

m = NextReadyNode(G);

S = ScheduleNode(S,m,P);

mark m as scheduled

end

return Length(S)

end

15.745 % Seth Comen Goldstein 2000-5
```

## ScheduleNode

- Goal: insert node m as early as possible
  - don't violate resource constraints
  - don't violate dependence constraints
- First try based on ASAP
- Until it is scheduled

47

- See if there is an FU that can execute m

15-745 © Seth Copen Goldstein 2000-5

- check source registers
  - if both from same RF as FU, done
  - if not: must decide what to do

| <ul> <li>Dealing with x-RF transfers</li> <li>Two ways to XFER: <ul> <li>Source can be Xfered this cycle</li> <li>Source can be copied in previous cycle</li> </ul> </li> <li>If neither is true <ul> <li>maybe commutative?</li> <li>try to schedule next cycle</li> </ul> </li> </ul> |    | <pre>Basic Scheduling of a Node<br/>algorithm ScheduleNode(S<br/>input Schedule S, node m,<br/>output: new schedule with<br/>var cs: control step<br/>begin<br/>cs = EarliestControlStep(m)-1;<br/>repeat<br/>cs++;<br/>f = GetNodeUnit(m, cs, P);<br/>if (f == Ø) continue;<br/>if (m needs arg from other RF) then<br/>CheckArgTransfer();<br/>if (no transfer possible) then continue;<br/>else TryScheduleTransfers();<br/>until (m is scheduled);<br/></pre> TryScheduleTransfers(); |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 15-745 @ Seth Copen Goldstein 2000-5                                                                                                                                                                                                                                                    | 49 | 15-745 © Seth Copen Goldstein 2000-5 50                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                         |    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |

51

# Handling loads

- After scheduling an op to a cluster, see if it is a load. Determine the partition of the result
- Scheduling of loads uses the RF of the address
- Scheduling the result
  - check to see if both units are free
  - if so, check to see where it is used most and schedule result in that RF

#### Benefits/Drawbacks

- Does not predetermine the partitioning
- handles many real world details
- local decisions only
- Time consuming
- Very specific to C6x
- may not scale to multiple clusters?

| <ul> <li>Provide the second structure of the second st</li></ul> | <ul> <li>RHOP Approach</li> <li>Opposite approach to conventional clustering</li> <li>Global view <ul> <li>Graph partitioning strategy [Aletà '01, '02]</li> <li>Identify tightly coupled operations - treat uniformly</li> </ul> </li> <li>Non scheduler-centric mindset <ul> <li>Prescheduling technique</li> <li>Doesn't complicate scheduler</li> <li>Enable global view of code</li> <li>Estimate-based approach [Lapinskii '01]</li> </ul> </li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 15-745 © Seth Copen Goldstein 2000-5 53                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 15-745 © Seth Copen Goldstein 2000-5 54                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| <text><figure><list-item></list-item></figure></text>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | Edge Weights<br>• Slack distribution allocates slack to certa<br>• Edge slack = $lstart_{dest} - latency_{edge}$<br>• First come, first serve method use<br>(0,0) (1) (0,0) (2) (0,1) (6) (0,1) (6) (0,1) (7) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1,2) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1 |
| 15-745 © Seth Copen Goldstein 2000-5 55                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 15-745 © Seth Copen Goldstein 2000-5 56                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |

## **RHOP** - Partitioning Phase

- Modified Multilevel-KL algorithm [Kernighan '69]
- Multilevel graph partitioning consists of two stages
  - 1. Coarsening stage
  - 2. Refinement stage



Coasening: meld partitions to

## **Cluster Refinement**

- 3 questions to answer:
  - 1. Which cluster should operations move from?
  - 2. How good is the current partition?
  - 3. How profitable is it to move X from cluster A to



15-745 © Seth Copen Goldstein 2000-5

# Node Weights

• Create a metric to determine resource usage



#### Where Should Operations Move From?





59

57

# Where Should Operations Move



## How Good is this Partition?



## How Good is This Proposed Move?



## **Experimental Evaluation**

15-745 © Seth Copen Goldstein 2000-5

- Trimaran toolset: a retargetable VLIW compiler
- Evaluated DSP kernels and SPECint2000

| Name   | Configuration                                           |
|--------|---------------------------------------------------------|
| 2-1111 | 2 Homogenous clusters<br>1 I, 1 F, 1 M, 1 B per cluster |
| 2-2111 | 2 Homogenous clusters<br>2 I, 1 F, 1 M, 1 B per cluster |
| 4-1111 | 4 Homogenous clusters<br>1 I, 1 F, 1 M, 1 B per cluster |
| 4-2111 | 4 Homogenous clusters<br>2 I, 1 F, 1 M, 1 B per cluster |
| 4-H    | 4 Heterogeneous clusters<br>IM, IF, IB and IMF clusters |

- 64 registers per cluster
- Latencies similar to Itanium
- Perfect caches
- For more detailed results, see paper

63

#### 2 Cluster Results vs 1 Cluster



## 4 Cluster Results vs 1 Cluster



65

## Conclusions

- A new, region-scoped method for clustering operations
  - Prescheduling technique
  - Estimates on schedule length used instead of scheduler
  - Combines slack distribution with multilevel-KL partitioning
- Performs better as number of resources increases

| Machine | RHOP vs BUG |
|---------|-------------|
| 2-1111  | -1.8%       |
| 2-2111  | 3.7%        |
| 4-1111  | 14.3%       |
| 4-2111  | 15.3%       |
| 4-H     | 8.0%        |

#### **Previous Work**

| Algorithm  | Whe    | When (rel. to sched) |        |       | Scope  |       | Desirability Metric |     |       |       | Grouping |  |
|------------|--------|----------------------|--------|-------|--------|-------|---------------------|-----|-------|-------|----------|--|
|            | During | Iterativ<br>e        | Before | Local | Region | Sched | Pseudo              | Est | Count | Hier. | Flat     |  |
| UAS        | X      |                      |        | X     |        | ×     |                     |     |       |       | X        |  |
| CARS       | X      |                      |        |       | ×      |       |                     | X   |       |       | X        |  |
| Convergent | X      |                      |        |       | ×      |       |                     | X   |       |       | X        |  |
| Leupers    |        | X                    |        | X     |        | ×     |                     |     |       |       | X        |  |
| Capitanio  |        | X                    |        |       | X      |       |                     |     | X     |       | X        |  |
| GP(B)      |        | X                    |        |       | X      |       | X                   |     |       | X     |          |  |
| B-ITER     |        |                      | ×      | X     |        |       |                     | X   |       |       | X        |  |
| BUG        |        |                      | X      | X     |        | ×     |                     |     |       |       | X        |  |
| RHOP       |        |                      | X      |       | X      |       |                     | X   |       | X     |          |  |

67