Loop Optimizations and Inlining
===============================

Loop optimizations
------------------
Loop invariant code motion (ch 15, uses reaching defs)
    apply to an expression if all reaching defs are outside the loop
        (or if some are inside, but are also loop-invariant)
			
    Example:
    
    void copy_offset(int *a, int offset, int n)
        for (int i = 0; i < n; ++i) {
            // conceptually: d = a + offset + i, f = a + offset + i + n
            int *c = a + offset;
            int *d = c + i;
            int *e = c + n;
            int *f = e + i;
            *d = *f;
        }
    }
    
    Version after loop invariant code motion:
    
    void copy_offset(int *a, int offset, int n)
        int *c = a + offset;
        int *e = c + n;
        for (int i = 0; i < n; ++i) {
            int *d = c + i;
            int *f = e + i;
            *d = *f;
        }
    }
    
Strength reduction
    Turn an expensive operation into a cheaper one, using loop structure.  For example, consider induction variables, which are increased or decreased by a constant amount in each loop.  A common application of strength reduction is on multiplying an induction variable by a loop invariant value, which can be transformed into another induction variable where the loop invariant value is added.

    Example:
    
    void compute_multiples(int count, int factor, int output[]) {
        for (int i = 0; i < count; ++i) {
            output[i] = i * factor;
        }
    }

    Version with strength reduction:
    
    void compute_multiples(int count, int factor, int output[]) {
        for (int i = 0, m = 0; i < count; ++i) {
            output[i] = m;
            m = m + factor;
        }
    }

Loop unrolling
    Duplicate the body of a loop.  Can reduce overhead of header and allow optimization across two loop bodies.  In some cases you know the loop executes exactly n times and you can unroll it n times and get straightline code.  In other cases, a loop remains, but overhead can be reduced.
    
    Example: see compute_multiples above

    Unrolled version of compute_multiples:

    void compute_multiples(int count, int factor, int output[]) {
        for (int i = 0, m = 0; i < count; ++i) {
            output[i] = m;
            m = m + factor;
            ++i;
            if (i >= count)
                break;
            output[i] = m;
            m = m + factor;
        }
    }

    
Software pipelining
    Issue: a single iteration of a loop may perform a sequence of operations that depend on each other.  If the operations take more than one cycle, either due to computation or memory access delays, there will be pipeline stalls.  Software pipelining overlaps the computations in two iterations of a loop.  We must rename variables to avoid conflict, and we must add special code to set up ("prime") and shut down ("flush") the pipeline.

    Example:
		loop // 9 cycles per loop iteration
			load a              // 1 cycle + 3 stalls
			t := a * 15         // 1 cycle + 3 stalls
			sum := sum + t      // 1 cycle
		
    Conceptual software pipelining - consider 3 iterations overlapping:
    
        loop
            1. load a              2. load b              3. load c
			2. t := a * 15         3. u := b * 15         4. v := c * 15
			3. sum := sum + t      4. sum := sum + u      5. sum := sum + c

    Software pipelining - resulting code:
    
        load a
        
        load b
        t := a * 15
        
        loop // 7 cycles per loop
            load c          // 1 cycle + 1 stall
            u := b * 15     // 1 cycle + 1 stall
            sum := sum + t  // 1 cycle
            b := c          // 1 cycle
            t := u          // 1 cycle
            
        u := b * 15
        sum := sum + t
        
        sum := sum + u

Exercise: apply one of the above optimizations to the following code
    Loop invariant code motion
    Loop unrolling
    Strength reduction
    Software pipelining

Interprocedural optimization
----------------------------
Interprocedural optimization means optimization across procedure boundaries.  This requires the compiler to have the whole program available.  JIT compilers have the whole program available, although they must be prepared for more code to be loaded at any time.  While C and Java compilers operate one file at a time, we can construct compilers for these languages that accept a whole program and assume it is complete.

Many local optimizations can be done interprocedurally if we gather the correct information.  E.g. interprocedural constant propagation can detect if a function is only called with constant arguments, and optimize the function body accordingly.  If a function is sometimes called with a constant value and sometimes not, we can *specialize* the function to the constant value.  Specialization makes a copy of the function and optimizes it based on some information, in this case the constant argument.  At whatever call sites invoke the function with that constant, we use the specialized version of the function instead of the normal one.

Inlining
    One of the most important interprocedural optimizations is inlining.  This optimization substitutes a call to a function with the body of that function.
		benefit: eliminates overhead of function call (good for small functions)
		benefit: enables optimization of function based on call site information (like specialization)
		cost: duplicates function (worse for large functions)
		
	requirement for inlining: we need to know what function is called
		Typically discovered through dataflow analysis
			In functional languages: starting with function definitions, we trace the  function value through program variables to call sites.  If only one function definition flows to that call site, we can inline (or alternatively make a direct call instead of an indirect call).
			In OO languages: starting with new statements we track the class of an object through variables until we invoke a method.  If there is only one class, we know which method is invoked and we can inline (or make a direct call instead of using OO dispatch)

    example functional programming
    
    map(list, fn x => x+1)
    
    original code in map(f):
        current = list
        while (current != nil) do {
            ... new_val = f(current.value) ...
        }
    
    inlined map code:
        current = list
        while (current != nil) do {
            ... new_val = current.value + 1 ...
        }

	example of inlining combined with other optimizations above:
	
	let fact n = {
		var result:int = 1;
		for (var i = n; i > 0; --i)
			result = result * i
		return result
	}
	let apply (f, v) = f(v)
	apply fact 4
	
    OPTIMIZATION STEPS:
    
    inline apply -> last line becomes fact(4)
    inline fact ->
		var result:int = 1;
		for (var i = 4; i > 0; --i)
			result = result * i
		return result
    loop unrolling ->
    	var result:int = 1;
        i = 4
        result = result * i
        --i;
        result = result * i
        --i;
        result = result * i
        --i;
        result = result * i
        --i;
        result
    constant propagation and folding ->
        24