Lectures 15+ - Garbage Collection - Most safe languages with objects and/or closures need a VM with memory management - Manual memory management is generally *unsafe*--leads to use after free - Some form of automatic memory management is inevitably needed - Java, C#, JavaScript, Smalltalk, Lisp, ML, Haskell, Python, Ruby, Virgil, almost everything: garbage collection - C/C++: manual, no safety guaranteed by runtime system - Swift: automatic reference counting - Rust: ownership and borrowing system, plus reference counts - Two fundamental concepts in automatic memory management: - reference counting: associate a count with every "object" - manage count on ref'ing and unref'ing - delete on reaching zero, may transitively delete - tracing: find live objects by traversing references during runtime - Most garbage collector algorithms are actually a mix of both - "A unified theory of garbage collection" by Bacon et al - reference counting cannot deal with cyclic garbage - tracing may repeatedly touch live objects, wasting runtime - Terminology - heap: the storage for objects - a set of one or more contiguous regions of memory - usually organized as blocks of "words" (sometimes called "cells") - may have alignment restrictions (e.g. a word or two words, as we've seen) - mutator: the program that changes the heap - may have one or more threads of execution - allocates new objects and changes them, changing the graph - roots: global variables, the stack, and other references that are not in objects - collector: one or more threads that execute garbage collector logic, determining unreachable objects and reclaiming their storage - liveness: an object is live if it will be accessed in the future execution of the mutator => liveness is an *undecidable* problem - reachability: an object is reachable if it is transitively reachable by following references from the roots through live objects - an approximation to the problem of liveness; a (safe) program cannot access objects that it does not have references to - allocator: the collection of logic associated with obtaining and freeing chunks of the heap - mutator operations: - new(): obtain new storage of a given size, usually associated with a program type - read(src, i): access a field of an object and return the value stored there - write(src, i, val): modify a field in an object by writing a value into it - atomic: - in multi-threaded scenarios, collectors require that some code sequences appear to execute atomically: all or nothing, all at once, no intermediate observations. - Overview of collection algorithms - Mark-sweep - tri-color scheme: objects are either white, gray, or black - white: not seen yet / not reachable - gray: reached, but not scanned yet - black: reached, completely scanned - marking: assigning colors to objects by tracing the graph - every object needs mark bits, either in-object or in a bitmap - bitmap design needed for uncooperative environments (conservative) - proceeds from the roots: globals, stack(s) of the program - objects reached from roots are colored gray and added to list - in each round, a gray object is chosen, its outgoing references marked, and then colored black - marking terminates when there are no more gray objects - marking produces a set of reachable (black) objects, and unreachable (white) objects - sweep: collect all the white objects and make their storage available - mark sweep does not move objects - can suffer from fragmentation - entire heap doesn't need to be eagerly swept; can do it lazily, or incremental - saves on big pause while cleaning up the whole heap - Mark-compact - mark phase similar or identical to the mark-sweep algorithm - objects can be moved (i.e. compacted) after, reducing fragmentation - multiple compaction algorithms - two-finger compaction algorithm: 2 passes - pass 1: - free pointer starts at beginning of heap, moves towards end - scan pointer starts at end, moves backwards - every step, copy an object at scan into free - when free meets scan, terminate copying - use the old location of an object to store a forwarding pointer - pass 2: - iterate over heap, fixing all pointers - drawbacks: rearranges heap topology somewhat randomly, bad for locality - three pass (lisp 2) algorithm - pass 1: compute a forwarding address for every object and store in separate field - pass 2: update all references to all objects - pass 3: copy objects to their final location - drawbacks: 3 passes, requires a separate field for every object - threaded compaction - turn all pointers to an object into a list of pointers by threading through the fields - use the header field of an object as the head of the linked list - last pointer stores the info that was overwritten - needs two passes over the heap: first pass threads forward pointers, second pass backwards pointers - major drawbacks: 3 passes over heap, 2 of which modify pointers, hard to parallelize due to linearization - compressor algorithm - divide heap into smaller (256 byte) chunks - compact each chunk individually - store metadata to make recomputing forwarding pointers easy (from mark bits plus offset) - possible to parallelize - Semispace copying - Separate the object space into at least two halves - Allocate in one half only; - One half is always completely empty during mutator execution - When the current space gets full, copy all live objects to to space - Instead of marking and mark bits, immediately copy an object upon marking - Use the to space as a queue - Issues for all algorithms - handling conservative references - fundamental metrics - overhead during mutator execution - throughput (overall execution time) - pause time (responsiveness) - memory consumption - visitation order - mark stack: depth-first graph traversal - mark stack can overflow - semi-space: Cheney-style is breadth-first - hierarcharchical decomposition: breadth-first up to a page limit - Detlefs-Printezis bitmap marking: avoid the mark stack overflowing - locality - original allocation order is pretty good - breadth-first is generally not good - depth-first not particularly good - hierarchichal decomposition and hot fields may or may not be worth it - number of passes over the heap - fragmentation - parallelism - Reference counting - basic idea: every object gets a reference count - incremented when a new reference is created, dec'd on destroy - upon 0, destroy/delete the object, then decrement counts of references it has - multithreaded mutator: needs atomic operations to update reference counts - deferred counting: don't count (exact) references from locals (stack) until an epoch, when all of them are counted - coalesced counting: save the (entire) state of an object before doing many writes and then "commit" the new state of the object after some time - Allocation - key differences in a GC context - space is usually freed all at once - more information available statically: size, type, layout of objects - applications tend to allocate on heap more often - Sequential allocation - use a large block of memory - maintain "free" and "limit" variables - free reaches limit => performance GC or out-of-memory - Free-list allocation - sequential fits: single list, always try first block - split if the block at the head of the list is larger, reusing smaller block - has a tendency to accumulate small leftover blocks near the front of the list - ends up utilizing space well, similar to best fits - next fits: single list, always try last successful block, loop back around on end - downside: objects from different phases of mutator execution get mixed up - poor locality allocating, because pointer cycles through all cells - objects allocated are spread out through memory - best fits: try to find block that most closely matches the allocation size - generally has best space utilization - bad worst-case: naive algorithm iterates over all blocks - tree-based schemes: - cartesian first fit: use a "cartesian tree" to organize the free list - not explained well in the book, code doesn't compile - splay tree: automatically rotates last searched-for element to the top - bitmapped-fits: - additional bit for each granule (word) of the heap - can use a lookup table to return the length of the longest free run of granules - can additionally use a run-length encoding to skip many bytes - advantages: compact and "on the side" - segregated lists: multiple lists for the same space - segregation by size into "size classes" - k size classes => k + 1 lists - round an allocation up to the next size class before searching - last list stores blocks that are larger than the largest class - need to make the sizeClass(size) operation fast => typically use power-of-two size classes - if the size classes are fixed at compile-time (VM build time), then sizeClass(K) is a also a constant => allocate from a single free list - each size class can be exhausted independently => allocation for smaller size can use larger block and split - Big Bag of Pages (BiBoP) - a large block (typically more than a page) only contains one size of objects (can have one mark/free bit for each object, rather than per word) - Buddy allocation: recombine two adjacent cells of size 2^k into one 2^k+1 - considerations: - alignment: often want to align on word or cache-line boundaries, e.g. double - size constraints: compacting collector may need space for forwarding pointer - boundary tag: may need to keep a tag to mark start or end of object and its size - heap parseability: may need to iterate from one end of the heap to another, e.g. mark-sweep - thunks in Haskell: overwrite object (included header) with its value - concurrent allocation: usually, each thread gets its own area, only sync on refill - Generational garbage collection - primary goal: reduce pause and improve throughput - avoid collecting the entire heap - basic idea: partition heap into more than 1 generation, organized by object age - what age: allocation time in bytes - write barrier dimensions - accuracy: filtering out unnecessary entries - granularity: field, object, card - duplicates? - remembered sets - hashsets - card tables - sequential store buffers - virtual memory tricks - Parallel collection - Concurrent collection - Incremental collection - Real-time collection - Runtime interface - allocation - are objects fully initialized? - 1. allocate an object of the proper size and alignment - 2. initialize the metadata of the object needed by the runtime, not escaped - 3. initialize program-level fields, may have escaped - arguments - size of allocation - alignment - kind of an object, array or struct - type of the object - inline allocation - fast path: bump-pointer allocation succeeds - slow path: call GC to get more memory - zeroing - often more profitable to zero in bulk - use demand-zero pages from the operating system - lazy zeroing: just ahead of the allocator - finding pointers - conservative: treat all words that look like pointers as pointers - range test to within the heap - alignment check - allocated check: consult allocated bitmap - BDW conservative collector uses blocks and object bits - blacklist: whenever a non-pointer points to a block (wrongly), never allocate the block - tagged values - pointer tagging: one or two bits to distinguish pointers vs. non pointers - big bag-of-pages: all objects in a block or page have the same type - fields in objects - header includes type, indicates the object "shape": where references are - the locations of pointers can be an offset table (can be reordered for different tracing) - butterfly layout: pointers on one side, non-pointers on the other - store a bitmap - generate a custom tracing method - globals - registers and stack - heap-allocate activation frames - need stack-walking logic - one use of the frame pointer - stackmap information: bitmap for every live reference - may be call-site specific: frame layout differs - one of the trickiest parts of the whole mess - easier if registers are caller-save - callee-save can be done with emulating while walking - may need compression to save space - can use delta encoding - as constants (immediates) in code - interior pointers - table (or mark bits) that records the start of every object - use 2-bit tags on tagged values to record object start - heap scan from first object on page/block - big bag of pages - derived pointers - p +/- i - compiler gives a table of expressions to reverse the derivation - object tables - make every object an indirection! - moving or copying object doesn't require fixup - table can compacted as well - references from external code - native code of the VM (e.g. C++) might reference objects - introduce Handle mechanism which holds indirection and keeps alive - may need to pin an object (keep it from moving) while a native call (or kernel call) runs - stack barriers - avoid scanning the entire stack repeatedly - overwrite a return address somewhere in the frame and cause it to jump to runtime - more often used for deoptimization - GC safe points - some points in the compiled code are not safe to GC (interior, derived pointers) - allocation points - calls to allocating routines - loop backedges - how to force a thread to stop? - need a "handshake" mechanism - synchronous or asynchronous - insert a safepoint poll - send a signal - collecting code - code may be dynamically generated (eval in JS, classloading in Java) - unloading or GC'ing code needs to be supported in those systems - best to reuse as much as possible of the same logic - read and write barriers Parallel collection - continue to assume "stop the world" - will require synchronization that implies overhead - some things inherently serial, e.g. tracing a linked list - tracing a tree or graph generates lots more work for each object scanned - not all marking work is same; some objects are really big, e.g. arrays - data suggests real programs tend to be pretty shallow - parallelizing tracing is harder, parallelizing sweeping or compaction easier by dividing the heap - load balancing - static: e.g. divide the heap up into N regions - may be suboptimal because work depends on # live objects inside - dynamic: e.g. divide into N regions contain same amount of live objects - further: overdivide work into smaller units, balance units - granularity: too small of units, e.g. individual objects, and overhead is too much - thread-local mark stacks - acquire, perform, and generate work units - locality is important: try to have threads work on their own dedicated regions of the heap - Parallel marking - acquire object from a worklist, add children to worklist - worklist should be thread-local - marking an object twice is "OK" - working stealing techniques - each thread, when markStackEmpty, searches other threads mark stacks and tries to steal - also much periodically put some work into its stealable queue for other threads - having a lock on each queue; give up trying to steal if cannot acquire lock - must atomically set the mark bits; try reading first - large objects are a problem; e.g. split into smaller chunks so mark stack has (address, size) pairs - Parallel copying - same set up, need to have per-thread copying stacks and one shared stack - use atomic compare-and-swap into the object header and TLABs - block-structured heaps - over-partition the heap into small blocks (say 512 words) - perform copying and scanning on small blocks - use work-stealing techniques to load-balance across blocks - issues: fragmentation, poor resultant locality, and synch overheads - hierarchical decomposition helps a lot here too - can tune the parallel scan of the card table - Parallel sweeping - obvious heap partitioning solution - lazy sweeping already parallelizes, as mutator threads do sweeping - Parallel compaction - parallel sliding compaction somewhat easier than copying - can divide the heap into regions for sliding for each thread - alternate direction of sliding to open bigger holes in the middle - requires the three phases, to update pointers first, then compact - Concurrent collection - collection cycle might be broken into smaller increments - cycle might execute at the same time as mutator thread - tri-color abstraction - white : not yet reached - gray : reached, maybe partially scanned - black : fully scanned, finished - no black-to-white references! - collector is advancing a gray wavefront - lost object problem: black-to-white reference with no gray-to-white reference (pg 311) - weak tricolor invariant: gray-protected: all black-to-white pointers have an incoming a chain of refs from gray - strong tricolor invariant: no black-to-white pointers - mutator color: treat the roots as one big object, can be gray or black Gray mutator - Write barriers: - Steele: black-to-white write reverts black object to grey - Boehm: black-to-anything write reverts black object to gray - Dijkstra: black-to-anything shades destination object Black mutator - Write barriers: - Abraham/Patel: write into white or gray object shades previous value - Read barrier: - Baker: read out a gray reference, shade it - Appel: read from a gray object shades the gray object