Lecture 17, 18, 19 - Optimizing JIT compilers Optimizing compilers - look a lot like a traditional compiler - designed for better code quality - SSA or other IR designed for optimization - performs one or more optimizations on IR - full-fledged register allocator - linear scan, graph coloring, another algorithm - multiple backends for multiple architectures - make use of profiling and type feedback - differentiation from static compilers - speculative optimization and deopt - higher emphasis on compile speed - assembles bytes of machine code directly in memory Design of an optimizing compiler - Typical design looks like a traditional compiler - Frontend: parse bytecode in IR directly or walk AST - Middle: optimizations and analysis - Backend: generate code from IR - possible to skip IR, but that falls into the baseline or single-pass design - will focus on the basic "three ends" model First design choices: what does IR look like? - Instructions and basic blocks - Static single assignment form - every variable assigned only once in the program - source-level variables become multiple IR-level variables - phi nodes merge data flow from different branches at merge points - every variable defined by exactly one instruction => the instruction *is* the value - definition must *dominate* uses - domination: X dom Y if all paths from start to Y must first pass through X - guarantees that - Without SSA, many optimizations require reasoning about overwriting variables - ultimately another form of abstract interpretation - Graph-based IR, often called "sea of nodes" - "SSA on steroids" - control and effect edges represent ordering constraints - requires a scheduling pass before generating code Implications of IR design: - How easy/fast is it to build IR? - transform from AST or bytecode into IR - how easy/fast are optimizations? - reasoning based on facts about inputs: forward dataflow problem - reasoning based on uses: backward dataflow problem - how easy/fast is it to generate code? - maybe a second IR, before machine code is necessary? - Optimizations in the optimizing compiler - Constant folding - Common subexpression elimination - Speculative optimizations - speculate on types (source-types, shapes) - speculate on numeric ranges (e.g. int not double) - Load elimination - Type propagation - Devirtualization - Inlining - Check elimination - Write barrier elimination - Dead code elimination - Code layout - hot/cold path layout - Speculation - Idea: incorporate information recording in interpretation/baseline execution - Turn previous observations into assumptions - Insert guards into the code to check if assumptions hold - Optimize code under new assumptions - Usually happens fairly early in compilation pipeline - Helps global type analysis do better - What to do if assumptions fail? - compile slow path (compensating code) - deoptimize to a lower tier (optimized code can assume slowpath never happens) - Deoptimization - requires reconstructing execution frame(s) for lower tiers - "side-exits" or "deopt points" -- places where guards fail and return to lower tier - every deopt point contains metadata or compensating code to allow reconstruction - can consume a lot of space that is not used unless deopt happens - Lowering - While single-pass compilers directly generate the machine code sequence for each bytecode or AST, node, optimizing compilers usually *lower* - Translate the high-level operations into low-level operations within same IR (TurboFan) - Translate high-level IR to low-level IR (CrankShaft and TurboFan) - Instruction selection - Low-level IR often has simpler operations than are available in CPU instructions - Or, conversely, most ISAs have more complex instructions than just load/store/arith - Examples are x86 addressing modes, arith-with-immediate, load/store at offset, etc - Instruction selection takes 1 or more low-level operations and generates 1 machine instruction - Tree-based pattern matching - Work on instructions in reverse order - Try to "cover" expression trees with "tiles" - a "tile" corresponds to a single machine instruction (typically) - a tile is connected to other tiles via virtual registers (vregs) - an unbounded supply of virtual registers is usually assumed - SSA deconstruction - need to convert SSA form into non-ssa form for execution on real machine - typical approach: insert moves at predecessors of phi locations (merges) - Register allocation - Typically done after instruction selection, when machine instructions are known - task: map each virtual register to a machine register - constraint: two vregs live at the same time cannot be in same physical register - too many live vregs at once: split and spill vregs to stack frame - spill: vreg lives on the stack for its entire lifetime and only loaded just before use - split: vreg live range broken into shorter ranges and (re)allocated individually