Incremental Path Profiling

Optimizing Compilers project by Laura Hiatt and Kevin Bierhoff (Spring 2006).

Project Description

The goal of our project is to look at what we will call “incremental path profiling”. Path profiling is an important part of compiler optimization; however, its computational overhead is quite high. Our project’s aim is to lessen the computational overhead of path profiling while still providing useful information about the hot paths of the program being compiled. We will do this by incrementally increasing the number of paths we profile by slowly adding instrumentation along high-use paths. By doing this, we hope to show that partial instrumentation of paths incurs less overhead than full path profiling while still providing enough information about which paths are hot to be useful in optimizations.

Full project proposal
Results poster

Path Profiling

Path profiling counts how often each path through a function is taken at runtime. Path profiles can be used by optimizing compilers. Functions can be compiled such that the “hot path”, the most frequently taken path through the function, executes particularly fast (Ammons & Larus, 1998).

While somewhat impractical for normal compilers, path profiling and subsequent recompilation to accommodate hot paths is feasible for modern Just in Time (JIT) compilers. Unfortunately, path profiling is expensive. For example, the original “efficient” algorithm proposed by Ball & Larus (1996) is reported by the authors to have 30% overhead. For a JIT compiler, 30% are a very high cost because they can be directly perceived by the end users of the software. One remedy is to work with edge profiles. Edge profiles are cheaper to acquire, but they also often fail to identify the hot path correctly (Ball & Larus, 1996, p. 10). Path profiling results in “complete counts”: After execution, the compiler knows exactly how often each path was taken.

Incremental Approach

In the incremental approach we only find the hot path and ignore other paths. We incrementally instrument functions just enough to identify the hot path. Based on Ball & Larus (1996) we use the following algorithm.

Introduce a unique node representing all EXITs. Remove all edges from the graph except the dummy back edge from EXIT to the entry node.
Take the first branch in the program, put its edges back into the graph, and connect their target nodes directly to EXIT.
Do the standard Ball & Larus instrumentation on the current graph.
Determine path counts for the current graph. A path number, e.g. 4, will in general represent multiple paths that start with the blocks included in the current graph.
Find the path that was taken the most, take its last node in the graph, remove that node's edge to EXIT and all instrumentation, add the next branch after that node into the graph, and connect that branch's targets to EXIT.
Repeat starting at 3 until a full path through the program is found.

The poster contains an example sequence of path graphs that lead to finding the complete hot path.

One benefit of this approach is that the Ball & Larus algorithms for path numbering and placing instrumentation can be used as they are described in their original paper (1996). Another benefit is that we always find the true hot path with this scheme. This is because we always select the (partial) path with the highest count for extention. Other heuristics for selecting the path to extend are conceivable. They may result in smaller graphs (and therefore less instrumentation overhead) but could potentially miss the true hot path. We defer an exploration of such heuristics to future work.

Experimental Results

Our implementation always finds the “true” hot path. It usually runs with less overhead than full path profiling. Ignoring file I/O the runtime of a partially instrumented program compares to full path profiling as follows.

test2: 80% - 113%
test5: 80% - 93%
test10: 68% - 76%
test20: 59% - 87%

This compares one run of a partially instrumented program with running a fully instrumented program. The number of iterations needed to determine the full hot path is linear in the length of the hot path. File I/O overhead accounts for almost 50% of the profiling overhead and can be ignored in a JIT compiler. It is unknown how many runs through the program one iteration needs to count before it can reliably identify the branch to extend. It should be smaller than the number of runs needed in full path profiling to find the complete hot path. We leave this question to future work.

Implementation

The implementation is based on the L3 compiler used in class. It hooks into the canonicalizer to insert instrumentation based on an incrementally extended path graph. The instrumentation uses an additional runtime library to write to a trace file that is analyzed in the next compiler run.

Results Spreadsheet

The experimental results are based on the timing (big) test cases from the L3 test suite. The spreadsheet can be read in the following way.

test*_path

Path (a)
0    XXX
1    XXX

Path (b)
0    XXX
1    XXX
Path    (c)
0    XXX
1    XXX

(a) section is the result of l3_test*.hot, so the full path. (b), (c), etc. are the result of hot_inc, so the result after each iteration.

Future Work

We think this is a promising result that can be further investigated in a number of ways.

Investigate the approach in a JIT compiler. Overhead due to recompilations can be a real problem. On the other hand the compiler could use partial path profiles early for optimizations.
Add instrumentation as soon as possible and see how often a function needs to be executed before its complete hot path can be determined. The number of runs necessary should not be too large compared to the number of runs necessary in full path profiling.
Investigate heuristics for path-to-extend selection that can potentially miss the hot path. Right now we instrument additional blocks that are not on that hot path. This lets us always find the true hot path.
Leverage opportunities to reduce the tracing overhead. For example, most paths remain unchanged from one iteration in the incremental algorithm to the next. It is not strictly necessary to count those unchanged paths again (assuming stable program behavior).