Lectures 15+ - Garbage Collection

- Most safe languages with objects and/or closures need a VM with memory management
- Manual memory management is generally *unsafe*--leads to use after free
- Some form of automatic memory management is inevitably needed
  - Java, C#, JavaScript, Smalltalk, Lisp, ML, Haskell, Python, Ruby, Virgil, almost everything: garbage collection
  - C/C++: manual, no safety guaranteed by runtime system
  - Swift: automatic reference counting
  - Rust: ownership and borrowing system, plus reference counts

- Two fundamental concepts in automatic memory management:
  - reference counting: associate a count with every "object"
    - manage count on ref'ing and unref'ing
    - delete on reaching zero, may transitively delete
  - tracing: find live objects by traversing references during runtime

- Most garbage collector algorithms are actually a mix of both
  - "A unified theory of garbage collection" by Bacon et al
  - reference counting cannot deal with cyclic garbage
  - tracing may repeatedly touch live objects, wasting runtime

- Terminology
  - heap: the storage for objects
    - a set of one or more contiguous regions of memory
    - usually organized as blocks of "words" (sometimes called "cells")
    - may have alignment restrictions (e.g. a word or two words, as we've seen)
  - mutator: the program that changes the heap
    - may have one or more threads of execution
    - allocates new objects and changes them, changing the graph
  - roots: global variables, the stack, and other references that are not in objects
  - collector: one or more threads that execute garbage collector logic, determining unreachable
    objects and reclaiming their storage
  - liveness: an object is live if it will be accessed in the future execution of the mutator
      => liveness is an *undecidable* problem
  - reachability: an object is reachable if it is transitively reachable by following references
    from the roots through live objects
    - an approximation to the problem of liveness; a (safe) program cannot access objects
      that it does not have references to
  - allocator: the collection of logic associated with obtaining and freeing chunks of the heap
  - mutator operations:
    - new(): obtain new storage of a given size, usually associated with a program type
    - read(src, i): access a field of an object and return the value stored there
    - write(src, i, val): modify a field in an object by writing a value into it
  - atomic:
    - in multi-threaded scenarios, collectors require that some code sequences appear to
      execute atomically: all or nothing, all at once, no intermediate observations.

- Overview of collection algorithms

- Mark-sweep
  - tri-color scheme: objects are either white, gray, or black
    - white: not seen yet / not reachable
    - gray: reached, but not scanned yet
    - black: reached, completely scanned
  - marking: assigning colors to objects by tracing the graph
    - every object needs mark bits, either in-object or in a bitmap
      - bitmap design needed for uncooperative environments (conservative)
    - proceeds from the roots: globals, stack(s) of the program
    - objects reached from roots are colored gray and added to list
    - in each round, a gray object is chosen, its outgoing references marked, and then colored black
    - marking terminates when there are no more gray objects
  - marking produces a set of reachable (black) objects, and unreachable (white) objects
  - sweep: collect all the white objects and make their storage available
  - mark sweep does not move objects
  - can suffer from fragmentation
  - entire heap doesn't need to be eagerly swept; can do it lazily, or incremental
    - saves on big pause while cleaning up the whole heap

- Mark-compact
  - mark phase similar or identical to the mark-sweep algorithm
  - objects can be moved (i.e. compacted) after, reducing fragmentation
  - multiple compaction algorithms
    - two-finger compaction algorithm: 2 passes
      - pass 1:
        - free pointer starts at beginning of heap, moves towards end
        - scan pointer starts at end, moves backwards
        - every step, copy an object at scan into free
        - when free meets scan, terminate copying
        - use the old location of an object to store a forwarding pointer
      - pass 2:
        - iterate over heap, fixing all pointers
      - drawbacks: rearranges heap topology somewhat randomly, bad for locality
    - three pass (lisp 2) algorithm
      - pass 1: compute a forwarding address for every object and store in separate field
      - pass 2: update all references to all objects
      - pass 3: copy objects to their final location
      - drawbacks: 3 passes, requires a separate field for every object
    - threaded compaction
      - turn all pointers to an object into a list of pointers by threading through the fields
      - use the header field of an object as the head of the linked list
      - last pointer stores the info that was overwritten
      - needs two passes over the heap: first pass threads forward pointers, second pass backwards pointers
      - major drawbacks: 3 passes over heap, 2 of which modify pointers, hard to parallelize due to linearization
    - compressor algorithm
      - divide heap into smaller (256 byte) chunks
      - compact each chunk individually
      - store metadata to make recomputing forwarding pointers easy (from mark bits plus offset)
      - possible to parallelize

- Semispace copying
  - Separate the object space into at least two halves
  - Allocate in one half only;
  - One half is always completely empty during mutator execution
  - When the current space gets full, copy all live objects to to space
  - Instead of marking and mark bits, immediately copy an object upon marking
  - Use the to space as a queue

- Issues for all algorithms
  - handling conservative references
  - fundamental metrics
    - overhead during mutator execution
    - throughput (overall execution time)
    - pause time (responsiveness)
    - memory consumption
  - visitation order
    - mark stack: depth-first graph traversal
    - mark stack can overflow
    - semi-space: Cheney-style is breadth-first
    - hierarcharchical decomposition: breadth-first up to a page limit
    - Detlefs-Printezis bitmap marking: avoid the mark stack overflowing
 - locality
    - original allocation order is pretty good
    - breadth-first is generally not good
    - depth-first not particularly good
    - hierarchichal decomposition and hot fields may or may not be worth it
 - number of passes over the heap
 - fragmentation
 - parallelism

- Reference counting
  - basic idea: every object gets a reference count
    - incremented when a new reference is created, dec'd on destroy
    - upon 0, destroy/delete the object, then decrement counts of references it has
  - multithreaded mutator: needs atomic operations to update reference counts
  - deferred counting: don't count (exact) references from locals (stack) until
    an epoch, when all of them are counted
  - coalesced counting: save the (entire) state of an object before doing many writes
    and then "commit" the new state of the object after some time
   
- Allocation
  - key differences in a GC context
    - space is usually freed all at once
    - more information available statically: size, type, layout of objects
    - applications tend to allocate on heap more often
  - Sequential allocation
    - use a large block of memory
    - maintain "free" and "limit" variables
    - free reaches limit => performance GC or out-of-memory
  - Free-list allocation
    - sequential fits: single list, always try first block
      - split if the block at the head of the list is larger, reusing smaller block
      - has a tendency to accumulate small leftover blocks near the front of the list
      - ends up utilizing space well, similar to best fits
    - next fits: single list, always try last successful block, loop back around on end
      - downside: objects from different phases of mutator execution get mixed up
      - poor locality allocating, because pointer cycles through all cells
      - objects allocated are spread out through memory
    - best fits: try to find block that most closely matches the allocation size
      - generally has best space utilization
      - bad worst-case: naive algorithm iterates over all blocks
    - tree-based schemes:
      - cartesian first fit: use a "cartesian tree" to organize the free list
        - not explained well in the book, code doesn't compile
      - splay tree: automatically rotates last searched-for element to the top
    - bitmapped-fits:
      - additional bit for each granule (word) of the heap
      - can use a lookup table to return the length of the longest free run of granules
      - can additionally use a run-length encoding to skip many bytes
      - advantages: compact and "on the side"
    - segregated lists: multiple lists for the same space
      - segregation by size into "size classes"
      - k size classes => k + 1 lists
      - round an allocation up to the next size class before searching
      - last list stores blocks that are larger than the largest class
      - need to make the sizeClass(size) operation fast
        => typically use power-of-two size classes
      - if the size classes are fixed at compile-time (VM build time), then
        sizeClass(K) is a also a constant => allocate from a single free list
      - each size class can be exhausted independently
        => allocation for smaller size can use larger block and split
      - Big Bag of Pages (BiBoP)
        - a large block (typically more than a page) only contains one size of objects
	  (can have one mark/free bit for each object, rather than per word)
      - Buddy allocation: recombine two adjacent cells of size 2^k into one 2^k+1
    - considerations:
      - alignment: often want to align on word or cache-line boundaries, e.g. double
      - size constraints: compacting collector may need space for forwarding pointer
      - boundary tag: may need to keep a tag to mark start or end of object and its size
      - heap parseability: may need to iterate from one end of the heap to another, e.g. mark-sweep
      - thunks in Haskell: overwrite object (included header) with its value
      - concurrent allocation: usually, each thread gets its own area, only sync on refill

- Generational garbage collection
  - primary goal: reduce pause and improve throughput
  - avoid collecting the entire heap
  - basic idea: partition heap into more than 1 generation, organized by object age
    - what age: allocation time in bytes
  - write barrier dimensions
    - accuracy: filtering out unnecessary entries
    - granularity: field, object, card
    - duplicates?
  - remembered sets
    - hashsets
    - card tables
    - sequential store buffers
    - virtual memory tricks
    
- Parallel collection
- Concurrent collection
- Incremental collection
- Real-time collection

- Runtime interface
  - allocation
    - are objects fully initialized?
    - 1. allocate an object of the proper size and alignment
    - 2. initialize the metadata of the object needed by the runtime, not escaped
    - 3. initialize program-level fields, may have escaped
  - arguments
    - size of allocation
    - alignment
    - kind of an object, array or struct
    - type of the object
  - inline allocation
    - fast path: bump-pointer allocation succeeds
    - slow path: call GC to get more memory
  - zeroing
    - often more profitable to zero in bulk
    - use demand-zero pages from the operating system
    - lazy zeroing: just ahead of the allocator
  - finding pointers
    - conservative: treat all words that look like pointers as pointers
      - range test to within the heap
      - alignment check
      - allocated check: consult allocated bitmap
      - BDW conservative collector uses blocks and object bits
      - blacklist: whenever a non-pointer points to a block (wrongly), never allocate the block
    - tagged values
      - pointer tagging: one or two bits to distinguish pointers vs. non pointers
      - big bag-of-pages: all objects in a block or page have the same type
    - fields in objects
      - header includes type, indicates the object "shape": where references are
      - the locations of pointers can be an offset table (can be reordered for different tracing)
      - butterfly layout: pointers on one side, non-pointers on the other
      - store a bitmap
      - generate a custom tracing method
    - globals
    - registers and stack
      - heap-allocate activation frames
      - need stack-walking logic
        - one use of the frame pointer
      - stackmap information: bitmap for every live reference
        - may be call-site specific: frame layout differs
        - one of the trickiest parts of the whole mess
        - easier if registers are caller-save
	- callee-save can be done with emulating while walking
	- may need compression to save space
	- can use delta encoding
     - as constants (immediates) in code
   - interior pointers
     - table (or mark bits) that records the start of every object
     - use 2-bit tags on tagged values to record object start
     - heap scan from first object on page/block
     - big bag of pages     
   - derived pointers
     - p +/- i
     - compiler gives a table of expressions to reverse the derivation
   - object tables
     - make every object an indirection!
     - moving or copying object doesn't require fixup
     - table can compacted as well
   - references from external code
     - native code of the VM (e.g. C++) might reference objects
     - introduce Handle mechanism which holds indirection and keeps alive
     - may need to pin an object (keep it from moving) while a native call (or kernel call) runs
   - stack barriers
     - avoid scanning the entire stack repeatedly
     - overwrite a return address somewhere in the frame and cause it to jump to runtime
     - more often used for deoptimization
   - GC safe points
     - some points in the compiled code are not safe to GC (interior, derived pointers)
     - allocation points
     - calls to allocating routines
     - loop backedges
     - how to force a thread to stop?
       - need a "handshake" mechanism
       - synchronous or asynchronous
         - insert a safepoint poll
         - send a signal
   - collecting code
     - code may be dynamically generated (eval in JS, classloading in Java)
     - unloading or GC'ing code needs to be supported in those systems
     - best to reuse as much as possible of the same logic
   - read and write barriers
   
Parallel collection

- continue to assume "stop the world"
- will require synchronization that implies overhead
- some things inherently serial, e.g. tracing a linked list
- tracing a tree or graph generates lots more work for each object scanned
- not all marking work is same; some objects are really big, e.g. arrays
- data suggests real programs tend to be pretty shallow
- parallelizing tracing is harder, parallelizing sweeping or compaction easier by dividing the heap
- load balancing
  - static: e.g. divide the heap up into N regions
    - may be suboptimal because work depends on # live objects inside
  - dynamic: e.g. divide into N regions contain same amount of live objects
    - further: overdivide work into smaller units, balance units
  - granularity: too small of units, e.g. individual objects, and overhead is too much
  - thread-local mark stacks
  - acquire, perform, and generate work units
  - locality is important: try to have threads work on their own dedicated regions of the heap
- Parallel marking
  - acquire object from a worklist, add children to worklist
  - worklist should be thread-local
  - marking an object twice is "OK"
  - working stealing techniques
    - each thread, when markStackEmpty, searches other threads mark stacks and tries to steal
    - also much periodically put some work into its stealable queue for other threads
    - having a lock on each queue; give up trying to steal if cannot acquire lock
    - must atomically set the mark bits; try reading first
    - large objects are a problem; e.g. split into smaller chunks so mark stack has (address, size) pairs
    
- Parallel copying
  - same set up, need to have per-thread copying stacks and one shared stack
  - use atomic compare-and-swap into the object header and TLABs
  - block-structured heaps
    - over-partition the heap into small blocks (say 512 words)
    - perform copying and scanning on small blocks
    - use work-stealing techniques to load-balance across blocks
    - issues: fragmentation, poor resultant locality, and synch overheads
    - hierarchical decomposition helps a lot here too
  - can tune the parallel scan of the card table    

- Parallel sweeping
 - obvious heap partitioning solution
 - lazy sweeping already parallelizes, as mutator threads do sweeping

- Parallel compaction
 - parallel sliding compaction somewhat easier than copying
 - can divide the heap into regions for sliding for each thread
 - alternate direction of sliding to open bigger holes in the middle
 - requires the three phases, to update pointers first, then compact
 
- Concurrent collection
 - collection cycle might be broken into smaller increments
 - cycle might execute at the same time as mutator thread
 - tri-color abstraction
   - white : not yet reached
   - gray  : reached, maybe partially scanned
   - black : fully scanned, finished
   - no black-to-white references!
 - collector is advancing a gray wavefront
 - lost object problem: black-to-white reference with no gray-to-white reference (pg 311)
   - weak tricolor invariant: gray-protected: all black-to-white pointers have an incoming a chain of refs from gray
   - strong tricolor invariant: no black-to-white pointers
 - mutator color: treat the roots as one big object, can be gray or black
   Gray mutator
   - Write barriers:
     - Steele: black-to-white write reverts black object to grey
     - Boehm: black-to-anything write reverts black object to gray
     - Dijkstra: black-to-anything shades destination object
   Black mutator
   - Write barriers:  
     - Abraham/Patel: write into white or gray object shades previous value
   - Read barrier:
     - Baker: read out a gray reference, shade it
     - Appel: read from a gray object shades the gray object