Lecture 8:

### Instruction-Level Parallelism

15-418 Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2023

#### Many kinds of processors



#### Why so many? What differentiates these processors?

### Why so many kinds of processors?

#### Each processor is designed for different kinds of programs



- GPUs
  - Programs with lots of independent work  $\rightarrow$  "Embarrassingly parallel"

• Many others: Deep neural networks, Digital signal processing, Etc.

#### Parallelism pervades architecture

- Speeding up programs is all about parallelism
  - 1) Find independent work
  - 2) Execute it in parallel
  - 3) Profit
- Key questions:
  - Where is the parallelism?
  - Whose job is it to find parallelism?

#### Where is the parallelism?

Different processors take radically different approaches

- CPUs: Instruction-level parallelism
  - Implicit
  - Fine-grain
- GPUs: Thread- & data-level parallelism
  - Explicit
  - Coarse-grain

### Whose job to find parallelism?

Different processors take radically different approaches

CPUs: Hardware dynamically schedules instructions

- Expensive, complex hardware → Few cores (tens)
- (Relatively) Easy to write fast software
- GPUs: Software makes parallelism explicit
  - Simple, cheap hardware → Many cores (thousands)
  - (Often) Hard to write fast software

Pentium 4"Northwood" (2002)



- Pentium 4 "Northwood" (2002)
- Highlighted areas actually execute instructions

→ <u>Most area spent</u> on scheduling (not on executing the program)



AMD Fiji (2015)



- AMD Fiji (2015)
- Highlighted areas actually execute instructions
  - ➔ Most area spent executing the program
    - (Rest is mostly I/O & memory, not scheduling)



#### Today you will learn...

How CPUs exploit ILP to speed up sequential code

- Key ideas:
  - Pipelining & Superscalar: Work on multiple instructions at once
  - <u>Out-of-order execution</u>: Dynamically schedule instructions whenever they are "ready"
  - Speculation: Guess what the program will do next to discover more independent work, "rolling back" incorrect guesses
- CPUs must do all of this while preserving the <u>illusion</u> that instructions execute in-order, one-at-a-time

#### In other words... Today is about:



### Buckle up!

### ...But please ask questions!

#### **Example: Polynomial evaluation** Compiling on ARM poly: int poly(int \*coef, int terms, int x) { int power = 1;int value = 0; .L3: for (int j = 0; j < terms; j++) { value += coef[j] \* power; power \*= x;} return value;

```
}
```

r0: value r1: &coef[terms] r2: x r3: &coef[j] r4: power r5: coef[j]

r1, #0 cmp ble .L4 push {r4, r5} r3, r0 mov add r1, r0, r1, lsl #2 r4, #1 movs r0, #0 movs ldr r5, [r3], #4 r1, r3 Cmp mla r0, r4, r5, r0 mul r4, r2, r4 bne .L3 {r4, r5} pop lr bx .L4: r0, #0 movs lr bx

#### Compiling on ARM

r0: value r1: &coef[terms] r2: x r3: &coef[j] r4: power r5: coef[j]

|                           | poly:<br>cmp<br>ble                     | r1, #0<br>.L4                                                 | amble     |
|---------------------------|-----------------------------------------|---------------------------------------------------------------|-----------|
| x) {                      | push<br>mov<br>add<br>movs<br>movs      | {r4, r5}<br>r3, r0<br>r1, r0, r1, lsl<br>r4, #1<br>r0, #0     | #2        |
| erms; j++) {<br>power;    | .L3:<br>ldr<br>cmp<br>mla<br>mul<br>bne | r5, [r3], #4<br>r1, r3<br>r0, r4, r5, r0<br>r4, r2, r4<br>.L3 | lteration |
| CMU 15-418/15-618, Fall 2 | pop<br>bx<br>.L4:<br>movs<br>bx         | {r4, r5}<br>lr<br>r0, #0<br>lr                                | Fini      |

Compiling on ARM

r0: value

r4: power

r5: coef[j]

r3: &coef[j]

r2: x

r1: &coef[terms]

Executing poly(A, 3, x)

cmp r1, #0
ble .L4
push {r4, r5}
mov r3, r0
add r1, r0, r1, lsl #2
movs r4, #1
movs r0, #0
ldr r5, [r3], #4
cmp r1, r3
mla r0, r4, r5, r0
mul r4, r2, r4
bne .L3

Executing poly(A, 3, x)

| cmp<br>ble<br>push<br>mo∨ | r1, #0<br>.L4<br>{r4, r5}<br>r3, r0 | Preamble  |
|---------------------------|-------------------------------------|-----------|
| add                       | r1, r0, r1, lsl                     | #2        |
| movs                      | r4, #1                              |           |
| movs                      | r0, #0                              |           |
| ldr                       | r5, [r3], #4                        | n         |
| стр                       | r1, r3                              | atic      |
| mla                       | r0, r4, r5, r0                      | ter       |
| mul                       | r4, r2, r4                          | 0         |
| bne                       | .L3                                 | <u>  </u> |

Executing poly(A, 3, x)

| cmp     r1, #0     open formula       ble     .L4     ldr     r5, [r3], #4       push     {r4, r5}     cmp     r1, r3 |      |                    |     |                |            |
|-----------------------------------------------------------------------------------------------------------------------|------|--------------------|-----|----------------|------------|
| ble .L4<br>push {r4, r5}                                                                                              | cmp  | r1, #0 😐           |     |                |            |
| push {r4, r5} 2 cmp r1, r3                                                                                            | ble  | .L4                | ldr | r5, [r3], #4   | n          |
|                                                                                                                       | push | {r4, r5}           | cmp | r1, r3         | atic       |
| mov r3, r0 mla r0, r4, r5, r0 <u>₽</u>                                                                                | mo∨  | r3, r0             | mla | r0, r4, r5, r0 | ter        |
| add r1, r0, r1, lsl #2 mul r4, r2, r4                                                                                 | add  | r1, r0, r1, lsl #2 | mul | r4, r2, r4     |            |
| movs r4, #1 bne .L3                                                                                                   | movs | r4, #1             | bne | .L3            |            |
| movs r0, #0 1dr r5, [r3], #4 5                                                                                        | movs | r0, #0             | ldr | r5, [r3], #4   | u          |
| ldr r5, [r3], #4 ទួ cmp r1, r3 🦉                                                                                      | ldr  | r5, [r3], #4 ह     | cmp | r1, r3         | ati        |
| cmp r1, r3 🙀 mla r0, r4, r5, r0 🛓                                                                                     | стр  | r1, r3             | mla | r0, r4, r5, r0 | iter       |
| mla r0, r4, r5, r0 💆 mul r4, r2, r4 😭                                                                                 | mla  | r0, r4, r5, r0 💆   | mul | r4, r2, r4     | 2          |
| mul r4, r2, r4 o bne .L3 "                                                                                            | mul  | r4, r2, r4 🛛 👷     | bne | .L3            | -          |
| bne .L3 <sup>+</sup> pop {r4, r5} -                                                                                   | bne  | .L3 <sup>+</sup>   | рор | {r4, r5}       | . <u> </u> |
| bx lr 🗉                                                                                                               |      |                    | bx  | lr             | ij         |

Executing poly(A, 3, x)



#### The software-hardware boundary

- The instruction set architecture (ISA) is a <u>functional</u> <u>contract</u> between hardware and software
  - It says what each instruction does, but not how
  - Example: Ordered sequence of x86 instructions
- A processor's microarchitecture is how the ISA is implemented

#### Arch : $\mu$ Arch :: Interface : Implementation

### Simple CPU model

- Execute instructions in program order
- Divide instruction execution into stages, e.g.:
  - 1. Fetch get the next instruction from memory
  - 2. Decode figure out what to do & read inputs
  - 3. Execute perform the necessary operations
  - 4. Commit write the results back to registers / memory
  - (Real processors have many more stages)

ldr r5, [r3], #4 r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3 r5, [r3], #4 ldr r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul .L3 bne





. . .



from memory

r5, [r3], #4 ldr r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3 ldr r5, [r3], #4 r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3



2. Decode "ldr r5, [r3] #4" and read input regs

ldr r1, r3 cmp mla mul bne ldr cmp mla mul bne

r4, r2, r4

r4, r2, r4

.L3

.L3

r1, r3



ldr cmp mla mul bne ldr cmp mla mul bne

r5, [r3], #4 r1, r3 r0, r4, r5, r0 r4, r2, r4 .L3 r5, [r3], #4 r1, r3 r0, r4, r5, r0 r4, r2, r4 .L3



4. Write values into regs r5 and r3

r5, [r3], #4 ldr r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3 ldr r5, [r3], #4 r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3



r5, [r3], #4 ldr r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3 ldr r5, [r3], #4 r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3



r5, [r3], #4 ldr r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3 ldr r5, [r3], #4 r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3



• • •

r5, [r3], #4 ldr r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3 ldr r5, [r3], #4 r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3



r5, [r3], #4 ldr r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3 r5, [r3], #4 ldr r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3



# Evaluating polynomial on the simple CPU model How fast is this processor?



### Simple CPU is very wasteful



### Pipelining
Idea: Start on the next instr'n immediately



Idea: Start on the next instr'n immediately



Idea: Start on the next instr'n immediately



Idea: Start on the next instr'n immediately



Idea: Start on the next instr'n immediately



Idea: Start on the next instr'n immediately



# Evaluating polynomial on the pipelined CPU How fast is this processor?

1 ns Latency? Throughput? TIME mla mul bne ldr ldr cmp mla|mul Fetch bne cmp ldr mla|mul bne | 1dr cmp mla mu1 Decode cmp bne ldr mla mul ldr mla Execute cmp cmp bne mla | ldr Commit ldr mul cmp cmp

Latency = 4 ns / instr

Throughput = 1 instr / ns 3, Fall 2023 **4X speedup!** 

CMU 15-418/15-618, Fall 2023

### Speedup achieved through pipeline parallelism

|         |     | TIME instruct |     |     |     |     | or works on 4<br>tions at a time |     |     |     |  |
|---------|-----|---------------|-----|-----|-----|-----|----------------------------------|-----|-----|-----|--|
| Fetch   | ldr | cmp           | mla | mul | bne | ldr | стр                              | mla | mul | bne |  |
| Decode  |     | ldr           | cmp | mla | mul | bne | ldr                              | стр | mla | mul |  |
| Execute |     |               | ldr | стр | mla | mul | bne                              | ldr | стр | mla |  |
| Commit  |     |               |     | ldr | стр | mla | mul                              | bne | ldr | стр |  |

#### Limitations of pipelining

- Parallelism requires <u>independent</u> work
- Q: Are instructions independent?
- A: No! Many possible hazards limit parallelism...

#### Data hazards

ldr ra, [rb], #4 // ra  $\leftarrow$  Memory[rb]; rb  $\leftarrow$  rb + 4 cmp rc, rd // rc  $\leftarrow$  rd + re

Q: When can the CPU pipeline the cmp behind 1dr?

| Fetch   | ldr | стр |     |     |     |  |
|---------|-----|-----|-----|-----|-----|--|
| Decode  |     | ldr | cmp |     |     |  |
| Execute |     |     | ldr | стр |     |  |
| Commit  |     |     |     | ldr | стр |  |

- A: When they use different registers
  - Specifically, when cmp does not read any data written by ldr

Cannot pipeline cmp (1dr writes r3)

r5, [r3], #4 ldr r1,<sup>\*</sup>r3 Cmp mla r0, r4, r5, r0 mul r4, r2, r4 .L3 bne r5, [r3], #4 ldr r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3



Cannot pipeline cmp (1dr writes r3)

r5, **[**r3], #4 ldr r1,<sup>\*</sup>r3 cmp mla r0, r4, r5, r0 mul r4, r2, r4 .L3 bne r5, [r3], #4 ldr r1, r3 cmp mla r0, r4, r5, r0 r4, r2, r4 mul bne .L3



Cannot pipeline cmp (1dr writes r3)



Cannot pipeline cmp (1dr writes r3)



Cannot pipeline cmp (1dr writes r3)



#### Stalling degrades performance

Processor works on <u>3</u>

#### instructions at a time TIME bne 1dr mla mul ldr mla Fetch mu l bne cmp cmp cmp mla mul bne ldr cmp | ldr mla mul Decode cmp mla bne ldr mul ldr Execute 1dr mla mul bne ldr Commit cmp

- But stalling is sometimes unavoidable
  - E.g., long-latency instructions (divide, cache miss)

#### Dealing with data hazards: Forwarding data

Wait a second... data is available after Execute!



Forwarding eliminates many (not all) pipeline stalls

### Speedup achieved through pipeline parallelism

|         |     |     | TIME |     | ir  | Processor works on 4<br>Istructions at a time © |     |     |     |     |
|---------|-----|-----|------|-----|-----|-------------------------------------------------|-----|-----|-----|-----|
| Fetch   | ldr | стр | mla  | mul | bne | ldr                                             | стр | mla | mul | bne |
| Decode  |     | ldr | стр  | mla | mul | bne                                             | ldr | стр | mla | mul |
| Execute |     |     | ldr  | стр | mla | mul                                             | bne | ldr | стр | mla |
| Commit  |     |     |      | ldr | стр | mla                                             | mul | bne | ldr | стр |

#### Pipelining is not free!

- Q: How well does forwarding scale?
- A: Not well... many forwarding paths in deep & complex pipelines



CMU 15-418/15-618, Fall 2023

#### Control hazards + Speculation

- Programs must appear to execute in program order
  All instructions depend on earlier ones
- Most instructions implicitly continue at the next...
- But branches redirect execution to new location

#### What if we always fetch the next instruction?



#### What if we always fetch the next instruction?



#### What if we always fetch the next instruction?



What if we always fetch the next instruction?



(Loop not finished)

#### Pipeline flushes destroy performance Processor works on 2 or 3 instructions at a time

| Fetch   | ldr | стр | mla | mul | bne |     |     | ldr | стр | mla |  |
|---------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|--|
| Decode  |     | ldr | cmp | mla | mul | bne |     |     | ldr | стр |  |
| Execute |     |     | ldr | стр | mla | mul | bne |     |     | ldr |  |
| Commit  |     |     |     | ldr | стр | mla | mul | bne |     |     |  |

Penalty increases with deeper pipelines

### Dealing with control hazards: Speculation!

- Processors do not wait for branches to execute
- Instead, they speculate (i.e., guess) where to go next + start fetching
- Modern processors use very sophisticated mechanisms
  - E.g., speculate in Fetch stage—before processor even knows instrn is a branch!
  - >95% prediction accuracy
  - Still, branch mis-speculation is major problem

### Pipelining Summary

- Pipelining is a simple, effective way to improve throughput
  - N-stage pipeline gives up to  $N \times$  speedup
- Pipelining has limits
  - Hard to keep pipeline busy because of hazards
  - Forwarding is expensive in deep pipelines
  - Pipeline flushes are expensive in deep pipelines

#### $\rightarrow$ Pipelining is ubiquitous, but tops out at $N \approx 15$

#### Software Takeaways

- Processors with a simple "in-order" pipeline are very sensitive to running "good code"
  - Compiler should target a specific model of CPU
  - Low-level assembly hacking
- But very few CPUs are in-order these days
  - E.g., embedded, ultra-low-power applications
- Instead, ≈all modern CPUs are "out-of-order"
  - Even in classic "low-power domains" (like mobile)

### **Out-of-Order Execution**

CMU 15-418/15-618, Fall 2023

### Increasing parallelism via dataflow

- Parallelism limited by many false dependencies, particularly sequential program order
- <u>Dataflow</u> tracks how instructions actually depend on each other
  - True dependence: read-after-write

Dataflow increases parallelism by eliminating unnecessary dependences

### Example: Dataflow in polynomial evaluation







ldr r5, [r3], #4 cmp r1, r3 mla r0, r4, r5, r0 mul r4, r2, r4 bne .L3



### Example: Dataflow polynomial execution

- Execution <u>only</u>, with perfect scheduling & unlimited execution units
  - Idr, mul execute in 2 cycles
  - cmp, bne execute in 1 cycle
  - mla executes in 3 cycles
- Q: Does dataflow speedup execution? By how much?
- Q: What is the performance bottleneck?












CMU 15-418/15-618, Fall 2023





CMU 15-418/15-618, Fall 2023

#### Example: Dataflow polynomial execution

- Q: Does dataflow speedup execution? By how much?
  - Yes! 3 cycles / loop iteration
  - Instructions per cycle (IPC) =  $5/3 \approx 1.67$ (vs. 1 for perfect pipelining)
- Q: What is the performance bottleneck?
  - mla: Each mla depends on previous mla & takes 3 cycles
  - → This program is latency-bound

#### Latency Bound

- What is the "critical path" of the computation?
  - Longest path across iterations in dataflow graph
  - E.g., mla in last slide (but could be multiple ops)
- Critical path limits maximum performance
- Real CPUs may not achieve latency bound, but useful mental model + tool for program analysis

#### Out-of-order (OoO) execution uses dataflow to increase parallelism

Idea: Execute programs in dataflow order, but give the *illusion* of sequential execution

#### High-level OoO microarchitecture



#### OoO is **hidden** behind in-order frontend & commit



Instructions only enter & leave instruction buffer in program order; all bets are off in between!

#### Example: OoO polynomial evaluation

- Q: Does OoO speedup execution? By how much?
- Q: What is the performance bottleneck?
- Assume perfect forwarding & branch prediction



| Fetch &<br>Decode | ldr | стр |    |     |     |
|-------------------|-----|-----|----|-----|-----|
| Execute           |     | 10  | dr | стр |     |
| Commit            |     |     |    | ldr | стр |

| Fetch &<br>Decode | ldr | стр | mla |     |     |     |  |  |  |
|-------------------|-----|-----|-----|-----|-----|-----|--|--|--|
| Execute           |     | ٦٥  | dr  | стр |     | mla |  |  |  |
| Commit            |     |     |     | ldr | стр | стр |  |  |  |

| Fetch &<br>Decode | ldr | стр | mla | mul |     |     |     |    |     |
|-------------------|-----|-----|-----|-----|-----|-----|-----|----|-----|
| Execute           |     | 10  | dr  | стр |     | mla | mı  | ני |     |
| Commit            |     |     |     | ldr | стр |     | mla |    | mul |

| Fetch &<br>Decode | ldr | стр | mla | mul | bne |     |     |   |     |     |
|-------------------|-----|-----|-----|-----|-----|-----|-----|---|-----|-----|
| Execute           |     | 10  | dr  | стр |     | mla | mı  | J | bne |     |
| Commit            |     |     |     | ldr | стр |     | mla |   | mul | bne |

| Fetch &<br>Decode | ldr | стр | mla | mul | bne | ldr | стр | mla | mul | bne | ldr | стр | mla | mul | bne | ldr |
|-------------------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| Execute           |     | 10  | dr  | стр |     | mla |     | mul |     | bne | ldr |     | стр | mla |     |     |
| Commit            |     |     |     | ldr | стр |     |     | mla |     | mul | bne |     | ldr | стр |     |     |

| Fetch &<br>Decode | ldr | стр | mla | mul | bne | ldr | стр | mla | mul | bne | ldr   | стр | mla | mul | bne | ldr |
|-------------------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-------|-----|-----|-----|-----|-----|
| Execute           |     | 10  | dr  | стр |     | mla |     | mul |     | bne | e ldr |     | стр | mla |     |     |
| Commit            |     |     |     | ldr | стр |     |     | mla |     | mul | bne   |     | ldr | стр |     |     |

- Wait a minute... this isn't OoO... or even faster than a simple pipeline!
- Q: What went wrong?
- A: We're throughput-limited: can only exec 1 instrn

### High-level **Superscalar** OoO microarchitecture

• Must increase pipeline width to increase ILP > 1



CMU 15-418/15-618, Fall 2023

#### Focus on Execution, not Fetch & Commit

- Goal of OoO design is to only be limited by dataflow execution
- Fetch and commit are over-provisioned so that they (usually) do not limit performance
  Programmers can (usually) ignore fetch/commit
- Big Caveat: Programs with inherently unpredictable control flow will often be limited by fetch stalls (branch misprediction)
  - E.g., branching based on random data

| Fetch & | ldr | ldr | r5, | [r3], # |
|---------|-----|-----|-----|---------|
| Decode  |     | Стр | r1, | r3      |
| 200040  | cmp | mla | r0, | r4, r5, |
|         |     | mul | r4, | r2, r4  |
|         |     | bne | .L3 |         |
|         |     | ldr | r5, | [r3], # |
| _       |     | cmp | r1, | r3      |
| Execute |     | mla | r0, | r4, r5, |
|         |     | mul | r4, | r2, r4  |
|         |     | bne | .L3 | ·       |
|         |     |     |     |         |
|         |     |     |     |         |
| Commit  |     |     |     |         |
|         |     |     |     |         |
|         |     |     |     |         |

r0

r0

| Fetch & | ldr | mla | bne | стр | mul |
|---------|-----|-----|-----|-----|-----|
| Decode  | стр | mul | ldr | mla | bne |
| Execute |     |     |     |     |     |
| Commit  |     |     |     |     |     |

| ldr | r5, [r | ^3], #4             |
|-----|--------|---------------------|
| cmp | r1, r3 | 3                   |
| mla | r0, r4 | 1, r5, r0           |
| mul | r4, r2 | 2, r4               |
| bne | .L3    |                     |
| ldr | r5, [r | <sup>-</sup> 3], #4 |
| cmp | r1, r3 | 3                   |
| mla | r0, r4 | 1, r5, r0           |
| mul | r4, r2 | 2, r4               |
| bne | .L3    |                     |

| Fetch & | ldr | mla | bne | стр | mul |
|---------|-----|-----|-----|-----|-----|
| Decode  | стр | mul | ldr | mla | bne |
|         |     | 10  | dr  |     |     |
| Execute |     |     |     |     |     |
|         |     |     |     |     |     |
|         |     |     |     |     |     |
| Commit  |     |     |     |     |     |



| Fetch & | ldr | mla | bne | стр | mul |
|---------|-----|-----|-----|-----|-----|
| Decode  | стр | mul | ldr | mla | bne |
|         |     | ٦٢  | dr  |     |     |
| Execute |     |     |     |     |     |
|         |     |     | mu  | 1]  |     |
| Commit  |     |     |     |     |     |









Commit





Commit





CMU 15-418/15-618, Fall 2023



CMU 15-418/15-618, Fall 2023

| Fetch & | ldr | mla | bne | стр | mul | ldr | mla | bne | стр | mul | ldr | mla | bne  | стр | mul | ldr |
|---------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|------|-----|-----|-----|
| Decode  | стр | mul | ldr | mla | bne | стр | mul | ldr | mla | bne | стр | mul | ldr  | mla | bne | стр |
|         |     | ٦٢  | dr  | стр | bne | mı  | J   | 10  | lr  | стр | bne | mı  | ul   | 10  | dr  | стр |
| Execute |     |     |     |     | mla |     |     | mla |     |     | mla |     |      | mla |     | mla |
|         | m   |     |     | u]  | ٦٢  | dr  | стр | bne | mı  | u]  | ٦٢  | dr  | стр  | bne | mı  | u]  |
|         |     |     |     | ldr | стр |     | mla | bne | стр | mla | bne | стр | mla  | bne | стр | mla |
| Commit  |     |     |     |     |     |     | mul | ldr |     | mul | ldr |     | mu 1 | ldr |     | mul |

CMU 15-418/15-618, Fall 2023

One loop iteration / 3 cycles!

### Structural hazards: Other throughput limitations

- Execution units are specialized
  - Floating-point (add/multiply)
  - Integer (add/multiply/compare)
  - Memory (load/store)
- Processor designers must choose <u>which</u> execution units to include and <u>how many</u>
- Structural hazard: Data is ready, but instr cannot issue because no hardware is available

### Example: Structural hazards can severely limit performance

| Fetch &<br>Decode | ldr | mla | bne | стр | mul | ldr | mla | bne | стр | mul | ldr | mla | bne | стр | mul | ldr |
|-------------------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|                   | стр | mul | ldr | mla | bne | стр | mul | ldr | mla | bne | стр | mul | ldr | mla | bne | стр |
| Mem<br>Execute    |     | ldr |     | ldr |     |     | 10  | dr  | ldr |     |     | 10  | ldr |     | ldr |     |
| Int<br>Execute    |     |     |     | стр | bne | стр | bne |     | стр | bne | стр | bne |     | стр | bne | стр |
| Mult<br>Execute   |     |     |     | mla |     | mul |     | mla |     |     | mul |     | mla |     | mul |     |
| Commit            |     |     |     | ldr | стр | mla |     | mul | ldr |     | mla |     | mul | ldr |     | mla |
|                   |     |     |     |     |     |     | •   | bne | SmD |     |     |     | bne | стр |     |     |

CMU 15-418/15-618, Fall 2023

One loop iteration / 5 cycles 😕

#### Throughput Bound

- Ingredients:
  - Number of operations to perform (of each type)
  - Number & issue rate of "execution ports"/"functional units" (of each type)
- Throughput bound = ops / issue rate
  - E.g., (1 mla + 1 mul) / (2 + 3 cycles)
- Again, a real CPU might not exactly meet this bound

#### Software Takeaway

- OoO is much less sensitive to "good code"
  - Better performance portability
  - Of course, compiler still matters, but much less
- OoO makes performance analysis much simpler
  - Throughput bound: Availability of execution ports
  - Latency bound: "Critical path" latency
  - Slowest gives good approximation of program perf

#### Scaling Instruction-Level Parallelism

CMU 15-418/15-618, Fall 2023

#### **Recall from last time:** ILP & pipelining tapped out... why?



CMU 15-418/15-618, Fall 2023
# Superscalar scheduling is complex & hard to scale

- Q: When is it safe to issue two instructions?
- A: When they are independent
  - Must compare <u>all pairs</u> of input and output registers
- Scalability:  $O(W^2)$  comparisons where W is "issue width" of processor
  - Not great!

## Limitations of ILP

#### Programs have limited ILP

- Even with perfect scheduling, >8-wide issue doesn't help
- 4-wide superscalar  $\times$  20-stage pipeline = **80** instrns in flight
- High-performance OoO buffers hundreds of instructions
- Pipelines can only go so deep
  - Branch misprediction penalty grows
  - Frequency (GHz) limited by power
- Dynamic scheduling overheads are significant
- Out-of-order scheduling is expensive

## Limitations of ILP → Multicore

- ILP works great! ...But is complex + hard to scale
- From hardware perspective, multicore is much more efficient, but...

### Parallel software is hard!

- Industry resisted multicore for as long as possible
- When multicore finally happened, CPU µarch simplified
  → more cores
- Many program(mer)s still struggle to use multicore effectively