This assignment (and the compiler implementation sequence) is heavily based on assignments written by Joe Gibbs Politz for his compilers class at UCSD. We appreciate his permission to use and adapt them for this course!
In this assignment you'll implement a compiler for a small language called
Adder, that supports 32-bit integers and three operations – add1
, sub1
,
and negate
. There is no starter code for this assignment; you'll do it from
scratch based on the instructions here.
To do this assignment, you will need to install rust
and cargo
:
https://www.rust-lang.org/tools/install
You may also (depending on your system) need to install
nasm
. On OSX you can use brew install nasm
; on other
systems your package manager of choice likely has a version. On Windows you
should use Windows Subsystem for Linux (WSL)
The assignments assume that your computer can build and run x86-64-bit binaries. This is true of most (but not all) mass-market Windows and Linux laptops. Newer Macs use a different ARM architecture, but can also run legacy x86-64-bit binaries, so those are fine as well.
The first few sections of the Rust Book walk you through installing Rust, as well. We'll assume you've gone through the “Programming a Guessing Game” chapter of that book before you go on, so writing and running a Rust program isn't too weird to you.
In each of the next several assignments, we'll introduce a language that we'll implement. We'll start small, and build up features incrementally. We're starting with Adder, which has just a few features –numbers and three operations.
There are a few pieces that go into defining a language for us to compile:
The concrete syntax of Adder is:
<expr> :=
| <number>
| (add1 <expr>)
| (sub1 <expr>)
| (negate <expr>)
The abstract syntax of Adder is a Rust datatype, and corresponds nearly
one-to-one with the concrete syntax. We'll show just the parts for add1
and
sub1
in this tutorial, and leave it up to you to include negate
to get
practice.
enum Expr {
Num(i32),
Add1(Box<Expr>),
Sub1(Box<Expr>)
}
The Box
type is necessary in Rust to create recursive types like these (see
Enabling Recursive Types with
Boxes).
If you're familiar with C, it serves roughly the same role as introducing a
pointer type in a struct field to allow recursive fields in structs.
The reason this is necessary is that the Rust compiler calculates a size and
tracks the contents of each field in each variant of the enum
. Since an
Expr
could be an Add1
that contains another Add1
that contains another
Add1
... and so on, there's no way to calculate the size of an enum variant like
Add1(Expr)
(What error do you get if you try?)
Values of the Box
type always have the size of a single reference (probably
represented as a 64-bit address on the systems we're using). The address will
refer to an Expr
that has already been allocated somewhere. Box
is one of
several smart pointer types whose memory are carefully, and automatically,
memory-managed by Rust.
A ``semantics'' describes the languages' behavior without giving all of the assembly code for each instruction.
An Adder program always evaluates to a single i32.
Num(5)
should evaluate to the
integer 5
).add1
and sub1
expressions perform addition or subtraction by one on
their argument.negate
produces the result of the argument multiplied by -1
There are several examples further down to make this concrete.
Here are some examples of Adder programs:
Concrete Syntax
(add1 (sub1 5))
Abstract Syntax
Add1(Sub1(Number(5)))
Result
5
Concrete Syntax
4
Abstract Syntax
Number(4)
Result
4
Concrete Syntax
(negate (add1 3))
Abstract Syntax
Negate(Add1(Number(3)))
Result
-4
We're going to start by just compiling numbers, so we can see how all the infrastructure works. We won't give starter code for this so that you see how to build this up from scratch.
By just compiling numbers, we mean that we'll write a compiler for a language where the “program” file just contains a single number, rather than full-fledged programs with function definitions, loops, and so on. (We'll get there!)
First, make a new project with
$ cargo new adder
This creates a new directory, called adder
, set up to be a Rust project.
The main entry point is in src/main.rs
, which is where we'll develop the
compiler. There's also a file called Cargo.toml
that we'll use in a little
bit, and a few other directories related to building that we won't be too
concerned with in this assignment.
We'll start by just focusing on numbers.
It's useful to set up the goal of our compiler, which we'll come back to repeatedly in this course:
“Compiling” an expression means generating assembly instructions that evaluate it and leave the answer in the
rax
register.
Given this, before writing the compiler, it's useful to spend some time thinking about how we'll run these assembly programs we're going to generate. That is, what commands do we run at the command line in order to get from our soon-to-be-generated assembly to a running program?
We're going to use a little Rust program to kick things off. It will look like
this; you can put this into a file called runtime/start.rs
:
#[link(name = "our_code")]
extern "C" {
// The \x01 here is an undocumented feature of LLVM (which Rust uses) that ensures
// it does not add an underscore in front of the name, which happens on OSX
// Courtesy of Max New
// (https://maxsnew.com/teaching/eecs-483-fa22/hw_adder_assignment.html)
#[link_name = "\x01our_code_starts_here"]
fn our_code_starts_here() -> i64;
}
fn main() {
let i : i64 = unsafe {
our_code_starts_here()
};
println!("{i}");
}
This file says:
libour_code
that we
can load and link against (we'll make it in a few steps)our_code_starts_here
. It
takes no arguments and returns a 64-bit integer.main
of this Rust program, we will call the our_code_starts_here
function in an unsafe
block. It has to be in an unsafe
block because Rust
doesn't and cannot check that our_code_starts_here
actually takes no
arguments and returns an integer; it's trusting us, the programmer, to ensure
that, which is unsafe
from its point of view. The unsafe
block lets us do
some kinds of operations that would otherwise by compile errors in Rust.Let's next see how to build a libour_code
file out of some x86-64 assembly
that will work with this file. Here's a simple assembly program that has a
global label for our_code_starts_here
that has a “function body” that
returns the value 31
:
section .text
global our_code_starts_here
our_code_starts_here:
mov rax, 31
ret
Put this into a file called test/31.s
if you like, to test things out (you
should now have a runtime/
and a test/
directory that you created).
We can create a standalone binary program that combines these with these
commands (substitute macho64
for elf64
on OSX, see below if you have an ARM-based Mac):
$ nasm -f elf64 test/31.s -o runtime/our_code.o
$ ar rcs runtime/libour_code.a runtime/our_code.o
$ ls runtime
libour_code.a our_code.o start.rs
$ rustc -L runtime/ runtime/start.rs -o test/31.run
$ ./test/31.run
31
The first command assembles the assembly code to an object file. The basic
work there is generating the machine instructions for each assembly
instruction, and enough information about labels like our_code_starts_here
to
do later linking. The ar
command takes this object file and puts it in a
standard format for library linking used by #[link
in Rust. Then
rustc
combines that .a
file and start.rs
into a single executable binary
that we named 31.run
.
If you have a newer Mac with a M1 or M2 chip, you need to modify this slightly. Start by installing the x86 target for Rust:
$ rustup target install x86_64-apple-darwin
Then, use the following sequence of commands:
$ nasm -f macho64 test/31.s -o runtime/our_code.o
$ ar rcs runtime/libour_code.a runtime/our_code.o
$ ls runtime
libour_code.a our_code.o start.rs
$ rustc -L runtime/ runtime/start.rs -o test/31.run --target=x86_64-apple-darwin
$ ./test/31.run
31
We haven't written a compiler yet, but we do know how to go from files
containing assembly code to runnable binaries with the help of nasm
and
rustc
. Our next task is going to be writing a program that generates assembly
files like these.
Let's revisit our definition of compiling:
“Compiling” an expression means generating assembly instructions that evaluate it and leave the answer in the
rax
register.
Since, for now, our programs are going to be single expressions (in fact just
single numbers), this means that for a program like “5”, we want to generate
assembly instructions that put the constant 5
into rax
.
Let's write a Rust function that does that, with a simple main
function that
shows it working on a single hardcoded input; this goes in src/main.rs
and is
the start of our compiler:
/// Compile a source program into a string of x86-64 assembly
fn compile(program: String) -> String {
let num = program.trim().parse::<i32>().unwrap();
return format!("mov rax, {}", num);
}
fn main() {
let program = "37";
let compiled = compile(String::from(program));
println!("{}", compiled);
}
You can compile and run this with cargo run
:
$ cargo run
Compiling adder v0.1.0 (/Users/joe/src/adder)
mov rax, 37
Really all I did here was look up the documentation in Rust about converting a
string to an integer and template the number into a mov
command. The input
37
is hardcoded, and to use the output like we did above, we'd need to
copy-paste the mov
command into a larger assembly file with
our_code_starts_here
, and so on.
Here's a more sophisticated main
that takes two command-line arguments: a
source file to read and a target file to write the resulting assembly to. It
also puts the generated command into the template we designed for our generated
assembly:
use std::env;
use std::fs::File;
use std::io::prelude::*;
fn main() -> std::io::Result<()> {
let args: Vec<String> = env::args().collect();
let in_name = &args[1];
let out_name = &args[2];
let mut in_file = File::open(in_name)?;
let mut in_contents = String::new();
in_file.read_to_string(&mut in_contents)?;
let result = compile(in_contents);
let asm_program = format!("
section .text
global our_code_starts_here
our_code_starts_here:
{}
ret
", result);
let mut out_file = File::create(out_name)?;
out_file.write_all(asm_program.as_bytes())?;
Ok(())
}
Since this now expects files rather than hardcoded input, let's make a test
file in test/37.snek
that just contains 37
as contents. Then we'll read the
“program” (still just a number) from 37.snek
and store the resulting
assembly in 37.s
. (snek
is a meme-y spelling of snake, which is a theme of
the languages in this course.)
Then we can run our compiler with these command line arguments:
$ cat test/37.snek
37
$ cargo run -- test/37.snek test/37.s
$ cat test/37.s
section .text
global our_code_starts_here
our_code_starts_here:
mov rax, 37
ret
Then we can use the same sequence of commands from before to run the program:
$ nasm -f elf64 test/37.s -o runtime/our_code.o
$ ar rcs runtime/libour_code.a runtime/our_code.o
$ rustc -L runtime/ runtime/start.rs -o test/37.run
$ ./test/37.run
37
We're close to saying we've credibly built a “compiler”, in that we've taken some source program and gone all the way to a generated binary.
The next steps will be to clean up the clumsiness of running 3 post-processing
commands (nasm
, ar
, and rustc
), and then adding some nontrivial
functionality.
There are a lot of thing we could do to try and assemble and run the program,
and we'll discuss some later in the course. For now, we'll simply tidy up our
workflow by creating a Makefile that runs through the compile-assemble-link
steps for us. Put these rules into a file called Makefile
in the root of the
repository (use elf64
on Linux):
test/%.s: test/%.snek src/main.rs
cargo run -- $< test/$*.s
test/%.run: test/%.s runtime/start.rs
nasm -f macho64 test/$*.s -o runtime/our_code.o
ar rcs runtime/libour_code.a runtime/our_code.o
rustc -L runtime/ runtime/start.rs -o test/$*.run
(Note that make
requires tabs not spaces, but we can only use spaces on the
website, so please replace the four spaces indentation with tab characters when
you copy it -- or copy it from Github.)
And then you can run just make test/<file>.run
to do the build steps:
$ make test/37.run
cargo run -- test/37.snek test/37.s
Finished dev [unoptimized + debuginfo] target(s) in 0.07s
Running `target/x86_64-apple-darwin/debug/adder test/37.snek test/37.s`
nasm -f macho64 test/37.s -o runtime/our_code.o
ar rcs runtime/libour_code.a runtime/our_code.o
rustc -L runtime/ runtime/start.rs -o test/37.run
The cargo run
command will re-run if the .snek
file or the compiler
(src/main.rs
) change, and the assemble-and-link commands will re-run if the
assembly (.s
file) or the runtime (runtime/start.rs
) change.
The overall syntax for the Adder language admits many more features than just
numbers. With the definition of Adder above, we can have programs like (add1 (sub1 6))
, for example. There can be any numbers of layers of nesting of the
parentheses, which means we need to think about parsing a little bit more.
We're going to design our syntax carefully to avoid thinking too much about parsing, though. The parenthesized style of Adder is a subset of what's called s-expressions. The Scheme and Lisp family of languages are some of the more famous examples of languages built in s-expressions, but recent ones like WebAssembly also use this syntax, and it's a common choice for language development to simplify decision around syntax, which can become quite tricky and won't be our focus in this course.
A grammar for s-expressions looks something like:
s-exp := number
| symbol
| string
| ( <s-exp>* )
That is, an s-expression is either a number, symbol (think of symbol like an identifier name), string, or a parenthesized sequence of s-expressions. Here are some s-expressions:
(1 2 3)
(a (b c d) e "f" "g")
(hash-table ("a" 100) ("b" 1000) ("c" 37"))
(define (factorial n)
(if (== n 1)
1
(factorial (* n (- n 1)))))
(class Point
(int x)
(int y))
(add1 (sub1 37))
One attractive feature of s-expressions is that most programming languages have
libraries for parsing them. There are several crates available for parsing
s-expressions in Rust. You're free to pick another one if you like it, but I'm
going to use sexp because its type definitions
work pretty well with pattern-matching and I find that helpful.
(lexpr also looks interesting, but the
Value
type is really clumsy with pattern matching so it's not great for this
tutorial.)
We can add ths package to our project by adding it to Cargo.toml
, which was
created when you used cargo new
. Make it so your Cargo.toml
looks like this:
[package]
name = "adder"
version = "0.1.0"
edition = "2021"
[dependencies]
sexp = "1.1.4"
Then you can run cargo build
and you should see stuff related to the sexp
crate be downloaded.
We can then use it in our program like this:
use sexp::*;
use sexp::Atom::*;
Then, a function call like this can turn a string into a Sexp
:
let sexp = parse("(add1 (sub1 (add1 73)))").unwrap()
(As a reminder, the .unwrap()
is our way of telling Rust that we are trusting
this parsing to succeed, and we'll panic!
and stop the program if the parse
doesn't succeed. We will talk about giving better error messages in these cases
later.)
Our goal, though, is to use a datatype that we design for our expressions, which we introduced as:
enum Expr {
Num(i32),
Add1(Box<Expr>),
Sub1(Box<Expr>),
}
So we should next write a function that takes Sexp
s and turns them into
Expr
s (or gives an error if we give an s-expression that doesn't match the
grammar of Adder). Here's a function that will do the trick:
fn parse_expr(s: &Sexp) -> Expr {
match s {
Sexp::Atom(I(n)) => Expr::Num(i32::try_from(*n).unwrap()),
Sexp::List(vec) => {
match &vec[..] {
[Sexp::Atom(S(op)), e] if op == "add1" => Expr::Add1(Box::new(parse_expr(e))),
[Sexp::Atom(S(op)), e] if op == "sub1" => Expr::Sub1(Box::new(parse_expr(e))),
_ => panic!("parse error"),
}
},
_ => panic!("parse error"),
}
}
(A Rust note – the parse_expr
function takes a reference to Sexp
(the type
&Sexp
) which means parse_expr
will have read-only, borrowed access to some
Sexp
that was allocated and stored somewhere else.)
This uses Rust pattern matching to match the specific cases we care about
for Adder – plain numbers and lists of s-expressions. In the case of lists, we
match on two the two specific cases that look like add1
or sub1
followed by
some other s-expression. In those cases, we recursively parse, and use
Box::new
to match the signature we set up in enum Expr
.
So we've got a way to go from more structure text—s-expressions—stored in
files and produce our Expr
structure. Now we just need to go from the Expr
ASTs to generated assembly. Here's one way to do that:
fn compile_expr(e: &Expr) -> String {
match e {
Expr::Num(n) => format!("mov rax, {}", *n),
Expr::Add1(subexpr) => compile_expr(subexpr) + "\nadd rax, 1",
Expr::Sub1(subexpr) => compile_expr(subexpr) + "\nsub rax, 1",
}
}
And putting it all together in main
:
fn main() -> std::io::Result<()> {
let args: Vec<String> = env::args().collect();
let in_name = &args[1];
let out_name = &args[2];
let mut in_file = File::open(in_name)?;
let mut in_contents = String::new();
in_file.read_to_string(&mut in_contents)?;
let expr = parse_expr(&parse(&in_contents).unwrap());
let result = compile_expr(&expr);
let asm_program = format!("
section .text
global our_code_starts_here
our_code_starts_here:
{}
ret
", result);
let mut out_file = File::create(out_name)?;
out_file.write_all(asm_program.as_bytes())?;
Ok(())
}
Then we can write tests like this add.snek
:
$ cat test/add.snek
(sub1 (sub1 (add1 73)))
And run our whole compiler end-to-end:
$ make test/add.run
cargo run -- test/add.snek test/add.s
Finished dev [unoptimized + debuginfo] target(s) in 0.02s
Running `target/x86_64-apple-darwin/debug/adder test/add.snek test/add.s`
nasm -f macho64 test/add.s -o runtime/our_code.o
ar rcs runtime/libour_code.a runtime/our_code.o
rustc -L runtime/ runtime/start.rs -o test/add.run
$ cat test/add.s
section .text
global our_code_starts_here
our_code_starts_here:
mov rax, 73
add rax, 1
sub rax, 1
sub rax, 1
ret
$ ./test/add.run
72
This is, of course, a very simple language. This tutorial serves mainly to make us use all the pieces of infrastructure that we'll build on throughout the quarter:
nasm
) and a Rust main program (runtime/start.rs
) to build
binariesenum Expr
)parse
from the sexp
crate) and a parser for our
abstract syntax (parse_expr
)compile_expr
) that generates assembly from Expr
sMost of our future assignments will be built from just these pieces, plus extra infrastructure added as we need it.
negate
as described in the beginning, and write several
tests for negate
as well.cat
on a source snek
file, then showing make
running,
using cat
on the resulting .s
file, and then running the resulting binary.
Copy this interactino into a file called transcript.txt
Hand in a zip of your entire repository to the Homework 0 assignment on Canvas.