18742 SPRING TERM PROJECT
Cache Replacement Policy using Dynamic Exclusion
Sourav Ghosh and Dong Wang
Abstract
Introduction
Experimental Methodology
Results and Analysis
Conclusions
References
Course Home Page
Other Course Project
Modern high-speed CPUs require fast access to the cache. The direct mapped caches have faster access time, but may be performance limited by high conflict misses. The Dynamic Exclusion replacement policy, proposed by Scott McFarling [1], is a technique to reduce the misses caused by conflicts in a direct-mapped cache. It uses a finite state machine (FSM) to recognize the common instruction references and exclude some instructions dynamically from the cache in order to reduce the conflict misses.
The effectiveness of Dynamic Exclusion is dependent on the instruction access pattern of a program. An attempt has been made to recreate the tests and to improve the performance of the original method with various cache configurations.
At first, we simulated the L1 cache assuming a comparatively very large L2 cache.Our simulation results with Dynamic Exclusion showed about 11% to 38% instruction miss rate reduction compared to the conventational direct-mapped for the L1 cache configuration, measured across the SPEC95 and IBS benchmarks. The simulation also showed improvement on the data miss rate to the extent of 11%. Next, we used a L2 cache size of 96KB. At the end, we modified the Dynamic Exclusion policy and got a reduction in the cost of number of bits of the L1 and L2 caches by 78.5% with just a 33% reduction in the performance compared to the original policy.
The objective of our research is exploring faster performance of the memory system. The speed of the memory system has not kept pace with the drastic rise of the CPU speed in recent years. This trend emphasizes the importance of the cache design. The modern high-speed CPU requires the presence of on-chip cache in order to avoid large access time. But the size of the on-chip cache must necessarily be small, which gives rise to high miss rate and subsequent large miss penalties. This necessitates the need of intelligent cache replacement algorithm to reduce the miss rate.
For an on-chip Level-1 cache, direct-mapped caches are generally preferred to set-associative caches because of lower access time. But direct-mapped caches have more number of conflict misses than set-associative caches. The Dynamic Exclusion Replacement policy, which was first proposed by Scott McFarling [1] in 1990, is a hardware technique to reduce the conflict misses in a direct-mapped cache. It recognizes the common instruction reference patterns, and tries to reduce the conflict misses by keeping one instruction in the cache between two instructions that alternate in execution and compete for the same line in the cache.
2.1 Common Instruction Reference Pattern
First, we need to look at the common instruction reference patterns. In general, the misses are caused by the interference between a pair of instructions that are mapped to the same address space in the cache. These are known as conflicting instructions. If two conflicting instructions are executed alternately, they'll knock each other out of the cache each time they are accessed. There are three common sources of instruction reference patterns that cause conflicts:
[In this report, the exponent gives the number of times a sequence is repeated. The subscript m and h denote the instructions that hit or miss respectively.]
An example is shown below for two conflicting instructions 'A' and 'B'.
for i = 1 to 10
for j = 1 to 10
A
for j = 1 to 10
B
Ignoring the loop overhead, the above example has the execution sequence:
For the above sequence, both the conventional and the optimal direct-mapped cache will have the same behavior:
With miss rates:
Every time an instruction is executed, it is either in the cache or should be placed in the cache as it will be executed in the next 9 times. Thus a conventional direct-mapped cache gives the optimal performance in this case.
The example is shown below:
for i = 1 to 10
for j =1 to 10
A
B
In this case, the conventional direct-mapped cache gives:
The optimal direct-mapped cache behavior would be:
In the normal direct-mapped cache, each access to 'B' results in two conflict-misses, as 'B' knocks 'A' out of the cache and causes another miss the next time 'A' is executed. The optimal direct-mapped cache keeps 'A' in the cache even when 'B' is executed. This reduces the miss ratio of 'A' and the overall miss ratio.
The example is shown below:
for i= 1 to 10
A
B
The behavior of the normal direct-mapped cache is:
And, the performance of the optimal direct-mapped cache would be:
In this case, a conventional direct-mapped cache will have 100% miss ratio as both the instructions will knock each other out. But the optimal direct-mapped cache chooses one instruction to be kept in the cache, which gets hit after the first miss. Hence, in order to improve on the direct-mapped cache performance, a replacement policy should recognize the instruction reference patterns and selects one instruction between two conflicting instructions to be kept in the cache.
2.2 Dynamic Exclusion Replacement Policy
Originally discussed in a Technical Report of Western Research Laboratory (Digital Equipment Corporation) by Scott McFarling[1], this policy tries to recognize the instruction reference patterns and emulate the behavior of an optimal direct-mapped cache. For the time being, we will assume each cache line can hold only one instruction at a time. The policy for larger cache line will be discussed later.
The Dynamic Exclusion Replacement policy is a simple hardware technique that reduces the number of misses by dynamically determining which instructions should be excluded from the cache. It uses a small finite-state machine(FSM) to recognize the instruction reference patterns.
In this implementation, each cache line at the Level 1 (L1) is associated with a hit-last bit 'h[]' and a sticky bit 's'. The cache line at Level 2 (L2) contains only 'h[]' bit. The sticky bit allows the cache to retain the instruction during the execution of a conflicting instruction. Whenever there is a hit, the sticky bit is set. In general an instruction is held in the cache while the sticky bit is set. The sticky bit is reset whenever there is a miss. Therefore, two sequential misses to a cache line will always remove an instruction from the cache.
The hit-last bit tells whether an instruction hit the last time it was in the cache. This enables some instructions to be brought immediately into the cache even when the sticky bit is set for the conflicting instruction. Whenever an instruction is not in the L1 cache, the next level cache (L2) or the lower level of memory hierarchy holds its hit-last bit. Hence, the hit-last bit is set on every access to the L1 cache. This bit is transferred to the L2 cache whenever that instruction in the L1 cache is replaced.
The state diagram in the Figure 2 describes the Dynamic Exclusion policy in detail, for two conflicting instructions 'A' and 'B'. The states 'A' and 'B' indicate instruction 'A' or 'B' is in the cache respectively. The '!s' and 's' denote whether the sticky bit associated with the current cache line/instruction is reset or set respectively. The symbol 'h[p]' stands for the hit-last bit of the instruction p. For example, the state 'A,s' means the instruction 'A' is in the L1 cache with 'sticky' bit set. There are overall four states in the L1 cache for two conflicting instructions 'A' and 'B'.
The state diagram will be understood clearly by examining the behavior of the three common instruction reference patterns described earlier.
One important to note is , the bit h[] does not always mean that the relevant instruction hit the last time in the L1 cache. e.g., the transition from 'A, !s' to 'B, s' does not alter the 'h[]' bit of 'A' . Rather it remains set. Again, the 'B, s' to 'A, s' does reset the 'h[]' bit of A. This aberration helps more random references to be included in the cache faster so that performance doesn't fall with respect to conventional direct-mapped cache in those types of references.
2.3 Dinero Simulation Methodology
In order to verify the Dynamic Exclusion Replacement Policy, the simulation tool dinero was used. Dinero[3], as written by Mike D.Hill of the University of Wisconsin, implements only conventional direct-mapped replacement policy along with FIFO, LRU and Random for the set-associative caches. Two extra bits 'sticky' and 'hit' were added in the 'stack-node' data structure in the stack based cache implementation of dinero. In order to simulate the L2 cache or the next level of memory hierarchy, the hit-last bit was stored in another stack of infinite size. Effectively, the next level of memory hierarchy was assumed to be of very large size so that during a cache miss, the corresponding hit-last bit would be searched in the has table and would be used by the finite state machine in order to simulate the Dynamic Exclusion Replacement Policy. The pseudo-code for the algorithm is given below. The first part deals with the cache miss and the second part deals with the cache hit.
if (sticky bit is zero for the current stack element) {
put the element/cache line in the stack with hit-last = 1, sticky = 1
update the L2 hash table for the new line and the hit-last bit of
the line that is thrown out of the L1
}
else { //if the sticky bit is = 1
if ( the line is found in the L2 cache/hash table) {
if (the hit-last bit of the line in L2 = 1) {
put the element in the L1 with hit-last bit = 0
change the hit-last bit of the element in the L2 = 0
update the L2 hash table for the new line and the
hit-last bit of the line that is thrown out of the L1
}
else { // hit-last of the line in L2 = 0
change the sticky bit of the cache element in the L1
to zero
}
}
else { //The line is not found in both L1 and L2
put the line in both the L2 as well as L1 caches with both
the hit-last bit and the sticky bit to one
}
}
}
update the L1 cache line hit bit = 1, sticky bit =1
2.4 Dynamic Exclusion with L1 and L2 caches
In the algorithm presented in the last section, we had a hash table for keeping the hit-last bits of the already referenced instructions. This is equivalent to a very large L2 where all the evicted hit-last bits can be found. In this section, we simulate both L1 and L2 cache by using unix socket to transfer hit-last bits between L1 and L2 cache. When a line is evicted from L1, the hit-last bit is saved to L2; when there is a L1 miss, the corresponding hit-last bit is transferred back to L1, then L1 can dynamically make exclusion decisions.
2.5 Dynamic Exclusion with longer line size
When line size is bigger than word size, exclude one line will make sequential access within the same line generate new misses. McFarling suggested a solution by adding a special line, called recently referenced line (RRL), to L1 to hold the most recently referenced line so that sequential access will hit in this line.
RRL indeed solved the sequential access miss problem. But if the most recently referenced line is already in L1, there is no need to put it to RRL. We propose a new method by adding a line, called excluded line (EL), to L1 to only hold the most recently excluded line. Since the excluded line is not in L1, this is an inexpensive way to increase the associativity of L1 just like a victim cache.
2.6 Dynamic Exclusion with reduced cost
In McFarling 's paper, two bits, sticky and hit-last bit, were added to each L1 line and one bit, hit-last, was added to each L2 line and the hit-last bit must be transferred between L1 and L2. The cost is (2bits*L1 line number + 1bit*L2 line number) bits.
There are also some problems in having to transfer hit-last bit between L1 and L2. Firstly, if L2 use larger line size than L1, then hit-last bits from different lines in L1 may be mixed together in L2. Lastly, when a line in L1 is evicted, whether the line is dirty or not, the hit-last bit must be transferred to L2, this will certainly consume bus bandwidth.
We modified the Dynamic Exclusion policy by only adding three bits to each L1 line and the cost is (3bits*L1 line number). Besides the sticky bit and hit-last bit, the third bit is used to hold the hit-last bit of the most recently evicted line. This method will reduce the cost to a great extent in terms of number of bits, but since we don't save the tag of the evicted line, the third bit is used regardless of the address, so the performance in terms of miss rate will certainly drop. But our simulation results will show that the lost of performance is not big.
3.1 Results with hashed L2 cache
Several benchmarks were used for simulation within Dinero. At first,
the block size (line size) was fixed at 4 bytes, which is equal to the
size of an instruction and the L1 cache size was varied from 4KB to 32KB.
From the results, we get a very large reduction in the instruction miss rate for small direct-mapped cache that reduces with the increase in the cache size. As the cache size increases, the chance of potential conflict misses drops down and hence the performance gap between the Dynamic Exclusion and the conventional direct-mapped cache reduces. At the L1 cache size of 8KB, we observe a maximum reduction of 38% for 'gs' and a minimum reduction of 11% for 'video_play'.
We also observe a little reduction in the data miss ratio for the benchmarks for smaller cache size. Similar to McFarling's [1] findings, we also observed worse performance for larger cache size than the conventional direct-mapped cache for some benchmarks viz., 'spice', 'gs' and 'cc'.
The main reason is the difference in the reference data reference patterns compared to the instruction reference patterns. It's rather less probable to have the same data being referenced many times in the same loop. Hence the scope of this Dynamic Exclusion Policy is limited. Rather, a direct-mapped cache might give better performance on data cache, especially when the cache size is big.
3.2 Results with L1 and L2 Caches
In this case, we used 96K L2 and varied L1 from 4K to 32K. The results are shown in Figure 5(excl. is Dynamic Exclusion; d.m. is direct-mapped without Dynamic Exclusion). When L1 equals 8K, the reductions in miss rates for the Dynamic Exclusion over direct-mapped are as follows, spice: 18.6%, cc: 10%, hotjava: 19.1%.
From the results, we can see that the Dynamic Exclusion does not need very large L2 to get a reasonable improvement, though the rate of improvement varies with different benchmarks. Same as before, the Dynamic Exclusion gives good performance for Java byte code.
In terms of run time, although our two-level simulation needs transfer hit-last bit between L1 and L2, the simulator doesn't slow down very much. For the two level spice benchmark simulation, the overall run time is 117 seconds using the Dynamic Exclusion Replacement policy, whereas it takes 91 seconds to run the same benchmark using conventional LRU policy. Hence, our simulator only runs 28.5% slower.
3.3 Results with longer line size
For the Dynamic Exclusion with longer line size, we used 32 bytes line size, and again with 96KB of L2. The results are shown in Figure 6(r.r.l. is McFarling 's most recently referenced line; e.l. is our excluded line; d.m. is direct-mapped without Dynamic Exclusion).
When L1 equals 8KB, the reduction in miss rate for RRL over direct-mapped is as follows, spice: 6.8%, cc: 2.9%, hotjava: 6.8%. The reduction in miss-rate for EL over direct-mapped is spice 11.1%, cc 5.2%, and hotjava 12%.
These results confirmed that the addition of one entry of excluded line gives rise to an additional associativity in the direct mapped cache like a victim cache. This is an extra improvement over McFarling's Findings.
3.4 Results with 3-bit overhead in the L1 line size
In reducing cost of the Dynamic Exclusion Replacement policy, we ran Dinero simulation for three benchmarks. The results are shown in Figure 7(L1 3 bits stands for three bits in L1 and no L2 cache; L1&L2 represents use two bits in L1 and 96K L2 cache; d.m. represents direct-mapped without Dynamic Exclusion).
At 32 byte line size, the reduction in miss rate for L1 3 bits over direct-mapped is, spice: 9.0%, cc: 1.5%, hotjava: 10.8%; the reduction in miss rate for L1 & L2 over direct-mapped is, spice: 11.1%, cc: 5.2%, hotjava: 12.1%. The average miss rate of L1 3 bits was increased by 33.9% over L1 & L2, but the cost in terms of number of bits was dropped by 78.5%.
Since our method uses two hit-last bits For each L1 Cache line, it can capture the behavior of two conflict Cache lines, but as soon as the conflict is between more than two lines, our method will behave worse than Mcfarling 's. The simulation results show that the repeated conflicts between more than 2 instructions are infrequent.
From our simulations performed with the Dinero for different cache configurations, it is found that the Dynamic Exclusion replacement policy is effective in reducing conflict misses in direct-mapped cache, when size of L1 equals 8K, the average reduction in miss rate is 23%. However, it is sensitive to program behavior.
The Dynamic Exclusion Policy does also improve the performance of the data caches of small size (<=16KB). But since the data access patterns are different and more random in nature, the policy is not potentially capable of improving the performance to a great extent. And for caches of larger sizes, especially when the size is more than 16KB, the conventional direct mapped caches perform better than the caches with Dynamic Exclusion policy. Though the caches with Dynamic Exclusion policy still offer overall better performance (overall miss ratio- instruction and data).
Our proposed method of using an excluded line in the cache not only solved the longer line size problem, but also increased the associativity of direct-mapped cache like a victim cache and this method is 42.1% better in miss rate than McFarling 's when line size equals 32 bytes.
Our method of using three bits in L1 eliminated the communication of
hit-last bit between L1 and L2, and reduced the cost of the Dynamic Exclusion
by 78.5% when L1 equals 8K, and L2 equals 96K and the miss rate only increased
by 33.9%. This is a win in today 's cost sensitive personal computer market.
REFERENCES
[1] Scott McFarling, "Cache
Replacement with Dynamic Exclusion", Western Research Laboratory
Technical Report, Digital Equipment Corporation, November 1991.
[2] Amitabh Srivastava, Alan Eustace. "ATOM:
A system for building Customized program analysis tools", Western
Research Laboratory Technical Report, Digital Equipment Corporation, March
1994
[3] Class listing of tools used in our project, including Dinero
and Atom.
Dinero is a cache simulator and Atom allows Dinero traces to be created
as applications run on Alpha workstations.
[4] List of URLs as the source of benchmarks: