









# Traditional Bus Structure Connecting CPU and Memory

A bus is a collection of parallel wires that carry address, data, and control signals.

Buses are typically shared by multiple devices.

















### **Conventional DRAM Organization**

#### d x w DRAM:

dw total bits organized as d supercells of size w bits





## **Conventional DRAM Organization**

#### d x w DRAM:

dw total bits organized as d supercells of size w bits









Step 1(a): Row access strobe (RAS) selects row 2 Step 1(b): Row 2 copied from DRAM array to row buffer















27



#### Locality: why caches work Types of cache misses Principle of Locality: Cold (compulsory) miss Programs tend to use data and instructions with addresses Cold misses occur on first accesses to given blocks near or equal to those they have used recently Conflict miss Temporal locality: Recently referenced items are likely to be referenced again in the near future Most hardware caches limit blocks to a small subset Spatial locality: Items with nearby addresses tend to be (sometimes a singleton) of the available cache slots referenced close together in time . e.g., block i must be placed in slot (i mod 4) Conflict misses occur when the cache is large enough, but Locality Example: multiple data objects all map to the same slot sum = 0;for (i = 0; i < n; i++) e.g., referencing blocks 0, 8, 0, 8, ... would miss every time • Data sum += a[i]: -Reference array elements in succession return sum; **Capacity miss** (stride-1 reference pattern): Spatial locality . Occurs when the set of active cache blocks (working set) is -Reference sum each iteration: Temporal locality larger than the cache Instructions -Reference instructions in sequence: Spatial locality -Cycle through loop repeatedly: Temporal locality 15-213, F'08 28

15-213, F'08













| Cache Type           | What is<br>Cached?      | Where is it Cached?    | Latency<br>(cycles) | Managed<br>By     |
|----------------------|-------------------------|------------------------|---------------------|-------------------|
| Registers            | 4-byte words            | CPU core               | 0                   | Compiler          |
| TLB                  | Address<br>translations | On-Chip TLB            | 0                   | Hardware          |
| L1 cache             | 64-bytes block          | On-Chip L1             | 1                   | Hardware          |
| L2 cache             | 64-bytes block          | Off-Chip L2            | 10                  | Hardware          |
| Virtual<br>Memory    | 4-KB page               | Main memory            | 100                 | Hardware+<br>OS   |
| Buffer cache         | Parts of files          | Main memory            | 100                 | OS                |
| Network buffer cache | Parts of files          | Local disk             | 10,000,000          | AFS/NFS<br>client |
| Browser<br>cache     | Web pages               | Local disk             | 10,000,000          | Web<br>browser    |
| Web cache            | Web pages               | Remote server<br>disks | 1,000,000,000       | Web proxy server  |

### **Summary**

- The memory hierarchy is a fundamental consequence of maintaining the random access memory abstraction and practical limits on cost and power consumption
- Locality makes caching effective
- Programming for good *temporal* and *spatial* locality is critical for high performance
  - For caching and for row-heavy access to DRAM
- Trend: the speed gaps between levels of the memory hierarchy continue to widen
  - Consequence: inducing locality becomes even more important

36

15-213, F'08