Lecture 17, 18, 19 - Optimizing JIT compilers

Optimizing compilers
  - look a lot like a traditional compiler
  - designed for better code quality
  - SSA or other IR designed for optimization
  - performs one or more optimizations on IR
  - full-fledged register allocator
    - linear scan, graph coloring, another algorithm
  - multiple backends for multiple architectures
  - make use of profiling and type feedback
  - differentiation from static compilers
    - speculative optimization and deopt
    - higher emphasis on compile speed
    - assembles bytes of machine code directly in memory

Design of an optimizing compiler
 - Typical design looks like a traditional compiler
  - Frontend: parse bytecode in IR directly or walk AST
  - Middle: optimizations and analysis
  - Backend: generate code from IR
 - possible to skip IR, but that falls into the baseline or single-pass design
 - will focus on the basic "three ends" model

First design choices: what does IR look like?
 - Instructions and basic blocks
 - Static single assignment form
   - every variable assigned only once in the program
   - source-level variables become multiple IR-level variables
   - phi nodes merge data flow from different branches at merge points
   - every variable defined by exactly one instruction => the instruction *is* the value
   - definition must *dominate* uses
     - domination: X dom Y if all paths from start to Y must first pass through X
   - guarantees that
 - Without SSA, many optimizations require reasoning about overwriting variables
   - ultimately another form of abstract interpretation
 - Graph-based IR, often called "sea of nodes"
   - "SSA on steroids"
   - control and effect edges represent ordering constraints
   - requires a scheduling pass before generating code

Implications of IR design:
 - How easy/fast is it to build IR?
   - transform from AST or bytecode into IR
 - how easy/fast are optimizations?
   - reasoning based on facts about inputs: forward dataflow problem
   - reasoning based on uses: backward dataflow problem
 - how easy/fast is it to generate code?
   - maybe a second IR, before machine code is necessary?

- Optimizations in the optimizing compiler
  - Constant folding
  - Common subexpression elimination
  - Speculative optimizations
    - speculate on types (source-types, shapes)
    - speculate on numeric ranges (e.g. int not double)
  - Load elimination
  - Type propagation
  - Devirtualization
  - Inlining
  - Check elimination  
  - Write barrier elimination
  - Dead code elimination
  - Code layout
    - hot/cold path layout

- Speculation
  - Idea: incorporate information recording in interpretation/baseline execution
  - Turn previous observations into assumptions
  - Insert guards into the code to check if assumptions hold
  - Optimize code under new assumptions
  - Usually happens fairly early in compilation pipeline
  - Helps global type analysis do better
  - What to do if assumptions fail?
    - compile slow path (compensating code)
    - deoptimize to a lower tier (optimized code can assume slowpath never happens)
  - Deoptimization
    - requires reconstructing execution frame(s) for lower tiers
    - "side-exits" or "deopt points" -- places where guards fail and return to lower tier
    - every deopt point contains metadata or compensating code to allow reconstruction
    - can consume a lot of space that is not used unless deopt happens

- Lowering
  - While single-pass compilers directly generate the machine code sequence for each
    bytecode or AST, node, optimizing compilers usually *lower*
  - Translate the high-level operations into low-level operations within same IR (TurboFan)
  - Translate high-level IR to low-level IR (CrankShaft and TurboFan)

- Instruction selection
  - Low-level IR often has simpler operations than are available in CPU instructions
  - Or, conversely, most ISAs have more complex instructions than just load/store/arith
  - Examples are x86 addressing modes, arith-with-immediate, load/store at offset, etc
  - Instruction selection takes 1 or more low-level operations and generates 1 machine instruction
  - Tree-based pattern matching
    - Work on instructions in reverse order
    - Try to "cover" expression trees with "tiles"
    - a "tile" corresponds to a single machine instruction (typically)
    - a tile is connected to other tiles via virtual registers (vregs)
    - an unbounded supply of virtual registers is usually assumed

- SSA deconstruction
  - need to convert SSA form into non-ssa form for execution on real machine
  - typical approach: insert moves at predecessors of phi locations (merges)

- Register allocation
  - Typically done after instruction selection, when machine instructions are known
  - task: map each virtual register to a machine register
  - constraint: two vregs live at the same time cannot be in same physical register
  - too many live vregs at once: split and spill vregs to stack frame
  - spill: vreg lives on the stack for its entire lifetime and only loaded just before use
  - split: vreg live range broken into shorter ranges and (re)allocated individually