Ideally, we would like to incorporate the notion of coherence misses into
our framework of locality analysis, so that a single model could
predict all forms of misses. How can this be accomplished? As you may
recall from Section , data locality occurs
whenever a cache line is reused, provided that the line has not been
ejected from the cache since it was last used. In a uniprocessor
architecture, a cache line will only be ejected from the cache if it is
replaced by another reference that maps into the same cache entry. Our
algorithm predicts that such replacements will occur whenever the volume of
data accessed between reuses exceeds the effective size of the cache. In a
multiprocessor architecture, a cache line can also be ejected from the
cache if the line is modified by another processor during the
interval between reuses, hence invalidating it from the given processor's
cache (we assume an invalidation-based coherence protocol). Under locality
analysis, the ejection of data from the cache is modeled through the localized iteration space, which is the set of loops that can carry data
locality. Therefore, to predict coherence misses, we extend the concept of
the localized iteration space to include not only the volume of data
accessed, but also whether lines are being modified by other processors.
Accurately predicting when another processor modifies a given data line is
a very difficult problem for the compiler, and requires a complete
understanding of the communication and synchronization patterns of an
application, as well the data addresses being accessed. In cases where the
application has been parallelized by the compiler, this may be feasible
since presumably the compiler must understand the communication patterns
very well to perform the parallelization. However, even in such cases,
factors such as dynamic scheduling may
make it difficult to predict exactly when the modifications occur
relative to other processors, and which lines are being modified. In
addition, while it may be tractable to understand when particular data
items are shared among processors (i.e. true sharing), it is more
difficult to predict the coherence misses that only occur because separate items fall within the same cache line (i.e. false
sharing) [81][19]. Finally,
when the compiler is dealing with explicitly-parallelized programs, as is
the case in our experiments, it is
difficult (if not impossible) for the compiler to extract the communication
patterns, since much of this semantic information is contained only in the
programmer's head.
Although the compiler cannot precisely understand the communication
patterns in the explicitly-parallelized codes used in our experiments, one
thing that serves as a useful hint is the explicit synchronization
statements inserted by the programmer to ensure that shared data are
accessed safely. Assuming that a program is ``properly
labeled'' [27] or
``data-race-free'' [2], synchronization statements should
exist between the time when one processor modifies data and a second
processor reads that data. Therefore, the compiler interprets explicit
synchronization as a hint that data communication may be taking place.
Ideally, it would be nice to know the data being protected by any given
synchronization variable, since such information would allow the compiler
to reason more precisely about which particular variables may have been
modified. However, since such semantic information is kept only in the
programmer's head, our approach is to conservatively assume the worst,
which is that all shared objects may have been modified, and therefore no
locality is carried across explicit synchronization. We incorporate this
notion into our localized iteration space model by saying that a loop
is not localized if it either (i) accesses too much data (to model
replacement misses, as described earlier in
Section ), or (ii) contains explicit
synchronization (to model coherence misses).