The localized iteration space is simply the set of innermost loops whose volume of data accessed in a single iteration does not exceed the cache size. Figure shows our algorithm for computing the localized iteration space, which consists of two major substeps. First, we use our reuse vector information to estimate the amount of data accessed by each loop (i.e. Tally_Loop_Traffic in Figure ). This involves visiting each array reference and projecting the amount of data it references onto the surrounding loops. We associate a running total of data accessed with each array reference, and this running total is potentially multiplied by the number of iterations of each surrounding loop, depending on the type of reuse the reference has with respect to that loop. This first substep is a simplified version of algorithms proposed previously [64][24][21]. Once the amount of data accessed by each loop has been estimated, the second substep is comparing these amounts with the expected cache capacity to decide which loops are localized (i.e. Is_Localized in Figure ). This second substep is a separate top-down and then bottom-up pass over the loops, which allows us to decide that (i) a loop is definitely localized if any surrounding loops are definitely localized (perhaps because they have been actively ``blocked'' or ``tiled'' by the locality optimizer), and (ii) a loop is definitely not localized if any internal loops are definitely not localized (perhaps because they reference too much data). Such information is particularly helpful when the amount of data referenced by a loop is a symbolic range, and therefore cannot be compared directly against a constant cache size.
Two complications that arise in this algorithm are (i) estimating the amount of data accessed in the presence of symbolic loop bounds, and (ii) modeling cache capacity given the potential for mapping conflicts in direct-mapped or set-associative caches. To address the symbolic loop bound problem, we do three things. First, we use interprocedural constant propagation to eliminate as many symbolic loop bounds as possible. Second, we represent the loop iterations and subsequent data traffic as symbolic ranges, which can include simple linear expressions of symbolic variables. In some cases we can reason about these symbolic values, such as when they are indices of surrounding loops (an example of this will be shown later in Figure ). Third, when all else fails, we use a default policy of assuming loop iteration counts to be either large or small to resolve such cases. (A fourth option which we explore later in Section is using control-flow feedback in such cases.) To address cache mapping conflicts, we approximate this effect simply by setting the ``effective'' cache size to be a fraction of the actual cache size. We will discuss the robustness of these approximations later in Section , and will suggest further improvements in Section .
Figure illustrates how our algorithm computes the localized iteration space for the example code introduced earlier in Section . We begin with the A[i][j] reference, which has spatial reuse along the inner loop and no reuse along the outer loop. During a single j iteration, A[i][j] will bring (at most) a single line into the cache (16 bytes). To compute the total data accessed by a single pass through the j loop, we multiply the current ``running total'' (16 bytes) by the total number of j iterations (100), and divide by the savings factor due to spatial reuse (2), yielding a total of 800 bytes. This is also the amount of data accessed by the A[i][j] reference during a single iteration of the i loop. Since A[i][j] has no reuse along the i loop, we simply multiply the running total (800 bytes) by the number of i iterations (3) to get a total of 2400 bytes accessed by A[i][j] in the entire loop nest. Next, we consider the B[j+1][0] reference, which has no reuse along the j loop and temporal reuse along the i loop. The data accessed during a single j iteration is again a single cache line (16 bytes), and because there is no reuse along j, we multiply by the total j iterations (100) to get 1600 bytes accessed during a single i iteration. Since B[j+1][0] has temporal reuse along i, the running total is not increased any further, and the total data accessed by B[j+1][0] during the loop nest is therefore 1600 bytes. We can ignore the B[j][0] reference, since it has group reuse with B[j+1][0], and B[j+1][0] is the leading reference. After tallying each reference's contribution to total data traffic, we see that a single iteration of j accesses only 32 bytes (and therefore is definitely localized), and a single iteration of i accesses 2400 bytes, which is also likely to fit in an 8 Kbyte cache. Therefore our algorithm would conclude that both loops are within the localized iteration space.
Figure presents a more complex example, where the upper bound of the inner loop is a symbolic variable (i). In this case, the amount of data accessed during a single i iteration is a symbolic expression (8i bytes), which cannot readily be compared against a fixed cache size. However, since the compiler understands how the i variable behaves, it recognizes that a triangular region within the A matrix is being accessed, and that i can be substituted with to compute a total of 840 bytes being accessed by the A[i][j] reference in this loop nest. Since 840 bytes is likely to fit in the 8 Kbyte cache, the compiler would again conclude that both loops are within the localized iteration space.
We represent the localized iteration space as a vector space, so that it can be directly compared with our vector space representation of data reuse. For example, if both loops in a two-level loop nest are within the localized iteration space (as in Figures (a), , and ), the corresponding localized vector space would be represented as . If only the innermost loop is localized (as in Figure (b)), then the localize vector space would be . We will now describe how the intrinsic data reuse and the localized iteration space can be combined to compute data locality.