Lecture 2: Code Formats - Levels of abstraction - High: source code with names, higher-order functions, classes, fields templates - Medium: functions, records, fields, objects - Low: assembly, memory, registers - Lower: machine code, bits, memory - Lowerer: micro-ops, reorder buffer, caches, memory interconnects - Lowererer: silicon, circuits, bits, voltages, gates and flipflops - Recap of Virtual Machine Diagram - Code format - Code loader - could be text parser, or binary loader - Code verifier - Intermediate representation (IR) - Object model - Memory management - Garbage collector - Interpreter - JIT Compiler - Debugger - Profiler - Libraries - Code formats: - What does the input to a VM look like? - Disk (or network) format: text or binary - Text: needs to be parsed, can use favorite parsing techniques - can use parser generator, or write recursive descent - Binary: needs to be *decoded*, lots of byte-oriented code - magic numbers to identify files - headers, sections, metadata, code - often allows skipping over sections with offsets / sizes - no good parser-generator tools for binary - sometimes come with a checksum - Concepts: functions, classes, bytecode, data, metadata, names - inside functions: instructions, control flow, data flow - Example binary formats: - ELF: machine code format for Unix OSes - Linux, BSD, Solaris, AIX - can be big- or little-endian - sections for code, data, or symbols - processed by kernel and user-space linker - undefined symbols are provided by linker - Mach-O: machine code format for MacOS - similar but different to ELF - can have code for multiple architectures - COFF/EXE: machine code format for Windows - JVM: .class file format - https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html - single source file produces multiple class files - multiple .class files can be grouped into a JAR - names significant for execution and linking - .class files are linked with each other using names - weirdly big-endian integers - constant pool: strings, longs, doubles - class definition: methods, fields, superclass, interfaces - no support for nested (i.e. inner) classes - code bundleded into method definitions - stack machine for local values - Android: Dalvik DEX format - A better classfile format for JVM code - all classes compiled into a single DEX file - names significant for execution and linking - supports either big-endian or little-endian - register machine for local values - CIL, CLR: .NET assemblies - contain classes, interfaces, resources (such as JPEGs) - https://docs.microsoft.com/en-us/dotnet/standard/assembly/ - WebAssembly: .wasm format - single .wasm file contains many functions, globals - no concept of classes, objects - names not significant to runtime semantics - list of sections that must appear in order - imports are explicitly listed rather than relying on undefined names - variable-length integers for all quantities - stack machine - Others: - .swf - flash - BPF, eBPF - CLISP - .pyc - Ethereum Virtual Machine (EVM) - LLVM IR bitcode format - Lua - Ocaml - What makes a good code format? - compact: dense encoding, eliminate redundancy, compressible - fast to load: simpl(er) than text, can skip, not bitpacked - fast to link: names and types, if necessary, readily available - fast to verify: simple type system - fast to execute: low-overhead operations, storage organized for speed - What makes a good bytecode design? - Key design choice: register machine, stack machine, or accumulator? - all code formats have some basic operations on values - but where are the values actually stored? - how do instructions operate on values? - Semantics - a code format is like a programming language - contains concepts for types, data, and code - the meaning and more importantly, behavior of these concepts are *semantics* - semantics should fully-specify the behavior of a machine for the input program - "wiggle room" in the implementation can lead to undesirable effects - non-portability: program does different things on different impl/hardware - hard to debug - security vulnerabilities: program can be "owned" by malicious inputs - memory safety (runtime type safety) strong typing - cannot write "out of bounds" and trash parts of the program, or the VM - prevents privilege escalation - key for building layers of trust, all security above would be compromised otherwise - Values - Operations - Control flow - Storage locations - Types - Functions - Data structures - tuples - arrays - structs - classes