

## **Virtual Memory: Concepts**

18-213/18-613: Introduction to Computer Systems 12<sup>th</sup> Lecture, October 1, 2024

### Announcements

- "Low-stakes take-home midterm" goes out on Monday evening (after small groups finish)
  - 80 minutes self-timed. Covers through virtual memort
  - Questions similar to homeworks, but only one attempt.
  - Tests what you've learned, as in a real midterm (and as in the Final).
  - Low-stakes: Only 4% of grade (could even be "half dropped").

## **Caching Wrap-Up**

- Quick review
- Miss-Rate Analysis
- Blocked Operations





## **Matrix Multiplication Example**

#### Description:

- Multiply N x N matrices
- Matrix elements are doubles (8 bytes)
- O(N<sup>3</sup>) total operations
- N reads per source element
- N values summed per destination
  - but may be able to hold in register

/\* ijk \*/ Variable sum held in register for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; { for (k=0; k<n; k++) sum += a[i][k] \* b[k][j]; c[i][j] = sum; } } matmult/mm.c

## Miss Rate Analysis for Matrix Multiply

#### Assume:

- Block size = 64B (big enough for eight doubles)
- Matrix dimension (N) is very large
  - Approximate 1/N as 0.0
- Cache is not even big enough to hold multiple rows

#### Analysis Method:

Look at access pattern of inner loop



## Matrix Multiplication (ijk)

```
/* ijk */
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
    for (k=0; k<n; k++)
        sum += a[i][k] * b[k][j];
        c[i][j] = sum;
    }
}    matmult/mm.c</pre>
```



#### Miss rate for inner loop iterations:

| <u>A</u> | <u>B</u> | <u>C</u> |
|----------|----------|----------|
| 0.125    | 1.0      | 0.0      |

Avg misses/iter = 1.125 Block size = 64B (eight doubles)

## Matrix Multiplication (kij)





# Miss rate for inner loop iterations: $\underline{A}$ $\underline{B}$ $\underline{C}$ 0.00.1250.125

Avg misses/iter = 0.25

Block size = 64B (eight doubles)

## Matrix Multiplication (jki)



## Miss rate for inner loop iterations:ABL1.00.0

Avg misses/iter = 2.0

Block size = 64B (eight doubles)

## **Summary of Matrix Multiplication**

```
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
    for (k=0; k<n; k++)
        sum += a[i][k] * b[k][j];
    c[i][j] = sum;
  }
}</pre>
```

```
for (k=0; k<n; k++) {
  for (i=0; i<n; i++) {
    r = a[i][k];
    for (j=0; j<n; j++)
        c[i][j] += r * b[k][j];
  }
}</pre>
```

```
for (j=0; j<n; j++) {
  for (k=0; k<n; k++) {
    r = b[k][j];
    for (i=0; i<n; i++)
      c[i][j] += a[i][k] * r;
  }
}</pre>
```

Bryant and O'Hallaron,

ijk(&jik):

- 2 loads, 0 stores
- avg misses/iter = 1.125

```
kij(&ikj):
```

- 2 loads, 1 store
- avg misses/iter = 0.25

#### jki (& kji):

- 2 loads, 1 store
- avg misses/iter = 2.0

## **Core i7 Matrix Multiply Performance**

#### Cycles per inner loop iteration



Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition

## Matrix Multiplication Cache Miss Analysis

#### Assume:

- Matrix elements are doubles. Cache line = 8 doubles
- Cache size C << n (much smaller than n)</p>



Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition

## **Cache Miss Analysis (cont)**

#### Assume:

- Matrix elements are doubles. Cache line = 8 doubles
- Cache size C << n (much smaller than n)</p>



#### Total misses:

•  $(9n/8) n^2 = (9/8) n^3$ 

## **Blocked Matrix Multiplication**



Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective Block size L x L

## **Cache Miss Analysis**

#### Assume:

- Cache line = 8 doubles. Blocking size  $L \ge 8$
- Cache size C << n (much smaller than n)</li>
- Three blocks fit into cache: 3L<sup>2</sup> < C</p>

#### First (block) iteration:

- Misses per block: L<sup>2</sup>/8
- Blocks per Iteration: 2n/L (omitting matrix c)
- Misses per Iteration:
   2n/L x L<sup>2</sup>/8 = nL/4
- Afterwards in cache (schematic)





## **Cache Miss Analysis**

#### Assume:

- Cache line = 8 doubles. Blocking size  $L \ge 8$
- Cache size C << n (much smaller than n)</p>
- Three blocks fit into cache: 3L<sup>2</sup> < C</p>



#### Total misses:

• nL/4 misses per iteration x  $(n/L)^2$  iterations =  $n^3/(4L)$  misses

## **Blocking Summary**

- No blocking (ijk): (9/8) n<sup>3</sup> misses
- Blocking: (1/(4L)) n<sup>3</sup> misses

#### Use largest block size L, such that L satisfies 3L<sup>2</sup> < C</p>

Fit three blocks in cache! Two input, one output.

#### Reason for dramatic difference:

- Matrix multiplication has inherent temporal locality:
  - Input data:  $3n^2$ , computation  $2n^3$
  - Every array elements used O(n) times!
- But program has to be written properly

## Hmmm, How Does This Work?!



#### Solution: Virtual Memory (today and next lecture)

## **Virtual Memory**

- Address spaces
- VM as a tool for caching
- VM as a tool for memory management
- VM as a tool for memory protection
- Address translation

CSAPP 9.1-9.2 CSAPP 9.3 CSAPP 9.4 CSAPP 9.5

**CSAPP 9.6** 

## **Blank Slide for Intro Sketching**

## **A System Using Physical Addressing**



## Used in "simple" systems like embedded microcontrollers in devices like elevators and digital picture frames

Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, Third Edition

## **A System Using Virtual Addressing**



Data word

- Used in all modern servers, laptops, and smart phones
- One of the great ideas in computer science

## **Address Spaces**

Linear address space: Ordered set of contiguous non-negative integer addresses:

- Virtual address space: Set of N = 2<sup>n</sup> virtual addresses {0, 1, 2, 3, ..., N-1}
- Physical address space: Set of M = 2<sup>m</sup> physical addresses {0, 1, 2, 3, ..., M-1}

## Why Virtual Memory (VM)?

#### Uses main memory efficiently

Use DRAM as a cache for parts of a virtual address space

#### Simplifies memory management

• Each process gets the same uniform linear address space

#### Isolates address spaces

- One process can't interfere with another's memory
- User program cannot access privileged kernel information and code

## Today

#### Address spaces

#### VM as a tool for caching

- VM as a tool for memory management
- VM as a tool for memory protection
- Address translation

## VM as a Tool for Caching

- Conceptually, virtual memory is an array of N contiguous bytes stored on disk.
- The contents of the array on disk are cached in *physical memory* (*DRAM cache*)
  - These cache blocks are called *pages* (size is P = 2<sup>p</sup> bytes)



## **DRAM Cache Organization**

DRAM cache organization driven by the enormous miss penalty

- DRAM is about **10x** slower than SRAM
- Disk is about **10,000x** slower than DRAM
- Time to load block from disk > 1ms (> 1 million clock cycles)
  - CPU can do a lot of computation during that time

#### Consequences

- Large page (block) size: typically 4 KB
  - Linux "huge pages" are 2 MB (default) to 1 GB
- Fully associative. *Why*?
  - Any VP can be placed in any PP
  - Requires a "large" mapping function different from cache memories
- Highly sophisticated, expensive replacement algorithms. Why?
  - Too complicated and open-ended to be implemented in hardware
- Write-back rather than write-through. *Why*?

## **Enabling Data Structure: Page Table**

- A page table is an array of page table entries (PTEs) that maps virtual pages to physical pages.
  - Per-process kernel data structure in DRAM



## Page Hit

 Page hit: reference to VM word that is in physical memory (DRAM cache hit)



## Page Fault

 Page fault: reference to VM word that is not in physical memory (DRAM cache miss)



## **Triggering a Page Fault**

User writes to memory location

| 80   | )483b7:                                                                                     | c7 05 10                               | 9d 04    | 08 0d       | movl              | \$0xd   | ,0x8049d10                    |     |
|------|---------------------------------------------------------------------------------------------|----------------------------------------|----------|-------------|-------------------|---------|-------------------------------|-----|
| is c | at portion (pa<br>currently on d<br>MU triggers pa<br>(More details ir<br>Raise privilege l | isk<br>age fault exe<br>later lecture) | ception  |             |                   |         | a[1000];<br>in ()<br>a[500] = | 13; |
|      | Causes procedu                                                                              | •                                      |          |             | ndler             |         |                               |     |
|      | User code                                                                                   |                                        | Kerr     | nel cod     | e                 |         |                               |     |
|      | movl                                                                                        | Exception: pa                          | ge fault | Exec<br>han | tute page<br>dler | e fault |                               |     |

Page miss causes page fault (an exception)



- Page miss causes page fault (an exception)
- Page fault handler selects a victim to be evicted (here VP 4)



- Page miss causes page fault (an exception)
- Page fault handler selects a victim to be evicted (here VP 4)



- Page miss causes page fault (an exception)
- Page fault handler selects a victim to be evicted (here VP 4)
- Offending instruction is restarted: page hit!



# **Completing page fault**

- Page fault handler executes return from interrupt (iret) instruction
  - Like ret instruction, but also restores privilege level
  - Return to instruction that caused fault
  - But, this time there is no page fault

```
int a[1000];
main ()
{
    a[500] = 13;
}
```

| 80483b7: | c7 05 | 5 10 9d 04 08 0 | d movl | \$0xd,0x8049d10 |
|----------|-------|-----------------|--------|-----------------|
|----------|-------|-----------------|--------|-----------------|



### **Allocating Pages**

#### Allocating a new page (VP 5) of virtual memory.



## **Allocating Pages**

#### Allocating a new page (VP 5) of virtual memory.



## Locality to the Rescue Again!

- Virtual memory seems terribly inefficient, but it works because of locality.
- At any point in time, programs tend to access a set of active virtual pages called the *working set* 
  - Programs with better temporal locality will have smaller working sets
- If (working set size < main memory size)</p>
  - Good performance for one process (after cold misses)
- If (working set size > main memory size )
  - Thrashing: Performance meltdown where pages are swapped (copied) in and out continuously
  - If multiple processes run at the same time, thrashing occurs if their total working set size > main memory size

# Today

- Address spaces
- VM as a tool for caching
- VM as a tool for memory management
- VM as a tool for memory protection
- Address translation

## VM as a Tool for Memory Management

Key idea: each process has its own virtual address space

- It can view memory as a simple linear array
- Mapping function scatters addresses through physical memory
  - Well-chosen mappings can improve locality



## VM as a Tool for Memory Management

#### Simplifying memory allocation

- Each virtual page can be mapped to any physical page
- A virtual page can be stored in different physical pages at different times
- Can allocate the same virtual addresses on the heap for multiple processes



### VM as a Tool for Memory Management

#### Sharing code and data among processes

Map virtual pages to the same physical page (here: PP 6)



# **Simplifying Linking and Loading**

#### Linking

- Each program has similar virtual address space
- Code, data, and heap always start at the same addresses.

#### Loading

- execve allocates virtual pages for .text and .data sections & creates PTEs marked as invalid
- The .text and .data sections are copied, page by page, on demand by the virtual memory system

#### Discussed later in lecture on Linking and Loading



# Today

- Address spaces
- VM as a tool for caching
- VM as a tool for memory management
- VM as a tool for memory protection
- Address translation

## VM as a Tool for Memory Protection

- Extend page table entries (PTEs) with permission bits
- MMU checks these bits on each access



47

# Today

- Address spaces
- VM as a tool for caching
- VM as a tool for memory management
- VM as a tool for memory protection
- Address translation

# **VM Address Translation**

- Virtual Address Space
  - *V* = {0, 1, ..., *N*−1}
- Physical Address Space
  - *P* = {0, 1, ..., *M*−1}
- Address Translation
  - MAP:  $V \rightarrow P \ U \{ \emptyset \}$
  - For virtual address a:
    - MAP(a) = a' if data at virtual address a is at physical address a' in P
    - $MAP(a) = \emptyset$  if data at virtual address a is not in physical memory
      - Either invalid or stored on disk

# Summary of Address Translation Symbols

#### Basic Parameters

- N = 2<sup>n</sup>: Number of addresses in virtual address space
- M = 2<sup>m</sup>: Number of addresses in physical address space
- P = 2<sup>p</sup> : Page size (bytes)

#### Components of the virtual address (VA)

- VPO: Virtual page offset
- VPN: Virtual page number

#### Components of the physical address (PA)

- **PPO**: Physical page offset (same as VPO)
- PPN: Physical page number

## **Address Translation With a Page Table**



**Physical address** 

### **Address Translation: Page Hit**



1) Processor sends virtual address to MMU

- 2-3) MMU fetches PTE from page table in memory
- 4) MMU sends physical address to cache/memory
- 5) Cache/memory sends data word to processor

### **Address Translation: Page Fault**



- 1) Processor sends virtual address to MMU
- 2-3) MMU fetches PTE from page table in memory
- 4) Valid bit is zero, so MMU triggers page fault exception
- 5) Handler identifies victim to page out (if dirty, writes pages to disk)
- 6) Handler pages in new page and updates PTE in memory
- 7) Handler returns to original process, restarting faulting instruction

## **Integrating VM and Cache**



VA: virtual address, PA: physical address, PTE: page table entry, PTEA = PTE address

# Speeding up Translation with a TLB

- Page table entries (PTEs) are cached in L1 like any other memory word
  - PTEs may be evicted by other data references
  - PTE hit still requires a small L1 delay

#### Solution: Translation Lookaside Buffer (TLB)

- Small set-associative hardware cache in MMU
- Maps virtual page numbers to physical page numbers
- Contains complete page table entries for small number of pages

# Summary of Address Translation Symbols

#### Basic Parameters

- N = 2<sup>n</sup>: Number of addresses in virtual address space
- M = 2<sup>m</sup>: Number of addresses in physical address space
- P = 2<sup>p</sup> : Page size (bytes)

#### Components of the virtual address (VA)

- TLBI: TLB index
- TLBT: TLB tag
- **VPO**: Virtual page offset
- VPN: Virtual page number

#### Components of the physical address (PA)

- **PPO**: Physical page offset (same as VPO)
- PPN: Physical page number

# Accessing the TLB

MMU uses the VPN portion of the virtual address to access the TLB:



# **TLB Hit**



#### A TLB hit eliminates a cache/memory access

### **TLB Miss**



**A TLB miss incurs an additional cache/memory access (the PTE)** Fortunately, TLB misses are rare. *Why?* 

# **Multi-Level Page Tables**

#### Suppose:

4KB (2<sup>12</sup>) page size, 48-bit address space, 8-byte PTE

#### Problem:

- Would need a 512 GB page table!
  - 2<sup>48</sup> \* 2<sup>-12</sup> \* 2<sup>3</sup> = 2<sup>39</sup> bytes
- Common solution: Multi-level page table

#### Example: 2-level page table

- Level 1 table: each PTE points to a page table (always memory resident)
- Level 2 table: each PTE points to a page (paged in and out like any other data)

Level 2 Tables

Level 1

Table

## **A Two-Level Page Table Hierarchy**



# **Translating with a k-level Page Table**



### Summary

#### Programmer's view of virtual memory

- Each process has its own private linear address space
- Cannot be corrupted by other processes

#### System view of virtual memory

- Uses memory efficiently by caching virtual memory pages
  - Efficient only because of locality
- Simplifies memory management and programming
- Simplifies protection by providing a convenient interpositioning point to check permissions

Implemented via combination of hardware & software

- MMU, TLB, exception handling mechanisms part of hardware
- Page fault handlers, TLB management performed in software