Project #3: SL Parser

15-212, Spring 1998

This assignment is due before 5:00pm on Friday, 20 Mar 1998. Please note that this is a firm deadline; no late submissions will be accepted. This assignment is arguably the most intricate (but also the most interesting) of the three SL assignments. Get started early!

Follow the instruction on the course web page for submitting your solution. Submit one solution per group. All your source files should be in the same directory. Do not use subdirectories. To submit your solution, you should copy just your source files to the submission directory. In particular, do not carry out your program development in the submission area.

Points will be given for good documentation and style. See the coding style document on the course web page. For this assignment, there'll be a 80-20 split for correctness and style, respectively, and there will be another 40 points allocated for the written answers from part 1, for a total of 140 points.

See the course web page on collaboration and late submission policies. Make sure your working directory is private and cannot be read by others!

Overview of the Assignment:

You already know, from homework 4, that given a context-free grammar G and a string T, a parser can do the following:

Decide whether G can generate T (i.e. "T is in the grammar G").
Produce a parse tree of T, which shows how to rewrite the start symbol S in G to produce T.

Your job will be to construct a parser to do both of the above tasks. Your program will take as its input T programs in SL; if T is in G, your program will produce a parse tree for T.

We will grade your program in two ways. First, we will feed it certain SL programs--some valid, some not--and check that your program outputs a valid parse for syntactically valid programs, and a meaningful error message for invalid programs. Second, we will plug your parser code into a working SL interpreter, replacing our interpreter's existing (correct) parser with yours; we'll then run the new interpreter on some SL code to see if it still runs properly. To pass the second test, your code will have to adhere tightly to the specs below.

1. The Grammar for SL:

Let us first examine the context-free grammar (SIGMA, V, S, R) for SL:

SIGMA = { PROGRAM, FUNCTION, PARAMLIST, PARAMREST, BLOCK, STATEMENT, BREAK, CALL, ARGLIST, ARGREST, IF, ELSE, LET, READ, RETURN, WHILE, WRITE, EXPR, BINOP, UNOP, number, identifier, function, break, call, if, else, let, read, return, while, write, (, ), {, }, ,, ;, =, +, -, *, /, %, <, >, <=, >=, ==, !=, &, |, ~, !}

V = { number, identifier, function, break, call, if, else, let, read, return, while, write, (, ), {, }, ,, ;, =, +, -, *, /, %, <, >, <=, >=, ==, !=, &, |, ~, !}

S = PROGRAM

R is given below:

PROGRAM -> FUNCTION* BLOCK
FUNCTION -> function identifier ( PARAMLIST ) BLOCK
PARAMLIST -> identifier PARAMREST | e
PARAMREST -> , identifier PARAMREST | e
BLOCK -> { STATEMENT* }
STATEMENT -> BREAK | CALL ; | IF | LET | READ | RETURN | WHILE | WRITE
BREAK -> break ;
CALL -> call identifier ( ARGLIST )
ARGLIST -> EXPR ARGREST | e
ARGREST -> , EXPR ARGREST | e
IF -> if EXPR BLOCK ELSE
ELSE -> else BLOCK | e
LET -> let identifier = EXPR ; | let identifier = CALL ;
READ -> read identifier ;
RETURN -> return EXPR ;
WHILE -> while EXPR BLOCK
WRITE -> write EXPR ;
EXPR -> number | identifier | ( EXPR ) | ( UNOP EXPR ) | ( BINOP EXPR EXPR )
BINOP -> + | - | * | / | % | & | | | < | > | <= | >= | == | !=
UNOP -> ~ | !

As always, e is epsilon, the empty string. Also, note that an SL program is a sequence of function declarations followed by the main body.

This description describes the syntax (how to write valid programs) of SL, but not the semantics (what a syntactically valid program actually does). We'll get to the semantics in the next assignment, but drawing on your knowledge of Java, the semantics of SL should be fairly obvious just from the grammar.

Your first task is to answer the following problems based on the grammar above. Put your answers in a text file called proj3-grammar.txt . Just one write-up per group is needed. See the note at the very end of this document concerning the header information you need at the top of this file.

Write the grammatically correct SL expressions for the following Java arithmetic expressions (watch your operator precedence for some of these!), and also give their final values. You may want to run it through Java to double-check. Assume that a==1, b==2, c==3, d==4, e==0.
- 1+2+3+4+5
- 1*(2+3)-4/5
- -1+2+-3+4+-5
- a && (!b || c) || !d && e
- (a==1 && b==-2 || c!=-3) && !d!=!-e
- a>=1 || b<=c || d>3 || e%5<4
Mr. Bar is confused about the difference between Java grammar and SL grammar. Locate everything that is wrong with his not-quite-SL program below, and for each: 1) indicate which SL grammar rule(s) it disobeys; and 2) rewrite it correctly in SL.
Extra credit (1 pt): What is the final output of this program?
(Note: This question is actually a parser question, not a grammar question, so you may want to wait until you're writing the parser to answer this one.)
- A) As we've noted before, a parser is really just a push-down automaton (DPDA). Briefly demonstrate this with regard to the top-down parser that you are now writing by answering these questions. In the parser:
  - What corresponds to the DPDA states?
  - What corresponds to the push-down store?
  - What corresponds to the input tape?
  - What is in each square of the input tape?
- B) The one lexeme lookahead feature that we use in our parser may strike you as being bogus, because in DPDAs, you can't push an already read input back into the input stream. Explain briefly why this is actually acceptable; i.e., explain how you would simulate this lookahead feature using a DPDA.

2. Patching Up the Lexer

Some of the keywords in the grammar above should be unfamiliar to you, and you'll notice that one of the keywords you implemented in Project #2 is no longer used. The first part of this assignment is to make some modifications to your lexer to accept this revised SL specification. Assuming that you adhered closely to OOP principles while writing your lexer, making these changes will be trivial.

Things to do to the lexer (in addition to whatever else you might find necessary):

Remove declare as a keyword.
Add call and function, which will allow you to make function calls.
Add let, which will allow you to make variable assignments.
Add "&", which will be a logical and operator (like && in C).
Add "|", which will be a logical or operator (like || in C).
Add "!", which will be the logical not operator.
Add "~", which is the unary negation operator.
Add a method called void putBack(Token t), which puts the token t back into the stream, so to speak. Notice that this backing-up capability is only one token deep; in other words, if you do multiple putBack() calls, only the most recent one matters. For example:

3. Setting Up the Parser

You will probably want to implement a top-down parser for this assignment. If you really want to implement a bottom-up parser, feel free to do so, but be warned that the TAs may not be able to help you as effectively if you get stuck. The top-down parser is also more intuitive and visually similar to the grammar.

First, you must create a Parser class, the main class for this assignment:

public class Parser { public Parser(InputStream s) { /* to be filled in by you */ } public void parse() { /* t.b.f.i.b.y. */ } public void print() { /* t.b.f.i.b.y. */ } public void interpret() { /* t.b.f.i.b.y. */ } private ParseNode tree; private SymbolTable symtbl; /* add other stuff as you need them */ }

This is the primary class for this assignment. When you create a Parser object, you pass it a stream that it needs to parse. The parser should create for itself a lexer so that it can chop up the stream into tokens. When parse() is called, the parser actually performs the parsing of the stream, then holds onto the root of the resulting parse tree as tree. At the end of the parsing, the lexer will also have completely filled out a symbol table. The parser needs access to that table, and saves it as symtbl. If your implementation has the symbol table separate from the lexer, this is obviously easy to do. If your implementation has the symbol table as a member object inside the lexer class, you'll need to add a method to your lexer that allows the parser to get a copy of the table.

print() simply prints out the parse tree in a readable text format. The tree should be printed out using pre-order, depth-first traversal. (The criterion for readability is that someone who is familiar with this project -- e.g., a TA -- can stare at the output and deduce what your printout means without having to ask you for an explanation.) This method will also, without a doubt, be your friend in debugging.

interpret() begins executing the parse tree; it assumes that tree is fully parsed and symtbl is filled in properly. You will concern yourself with interpret() in the next assignment.

So far so good. The only tricky part seems to be parse(). Before we get into that, however, we need to look at this mysterious class called ParseNode. Notice that tree is of that type, so perhaps you have already guessed that ParseNode is what the parse tree is made of; they are in fact the nodes of the parse tree, each containing pointers to children nodes. Let's look at it a bit more closely:

public abstract class ParseNode implements GrammarTypes { public ParseNode() {...} public int numChildren() {...} public ParseNode getChild(int i) {...} public void addChild(ParseNode node) {...} public abstract int getType(); public abstract ParseNode interpret(SymbolTable st) throws InterpretException, BreakException, ReturnException; public abstract void print(PrintStream ps); /* define other stuff as you need them */ }

public interface GrammarTypes { public final static int G_NOP = 0; public final static int G_NUMBER = 1; public final static int G_IDENT = 2; public final static int G_PROGRAM = 3; public final static int G_FUNCTION = 4; public final static int G_BLOCK = 5; public final static int G_BREAK = 6; public final static int G_CALL = 7; public final static int G_IF = 8; public final static int G_LET = 9; public final static int G_READ = 10; public final static int G_RETURN = 11; public final static int G_WHILE = 12; public final static int G_WRITE = 13; public final static int G_EXPR = 14; // you may NOT modify this interface!!! }

Whew! Lots of stuff here. First, notice that ParseNode is an abstract class, so you can't instantiate it. What you will do is create many subclasses of ParseNode (more details below), but all of them will have at least the stuff defined above. Remember that each node maintains a list of pointers to its children. numChildren() returns the number of children that this node has. getChild(i) returns the ith child of this node. addChild(node) appends node to the list of children of this node. getType() returns the type of the node, as defined by one of the constants in GrammarTypes (again, see below for details). print() prints out the contents of the node in a readable format; this will be useful for debugging. (Hint: if you implement these print() methods recursively, you'll be able to greatly simplify writing Parser.print().)

Notice that you'll need some data structure to store an ordered list of children. (Please note that it must be ordered; in other words, something like a Vector.) Once you have this, you can implement the necessary subclasses of ParseNode. There is a subclass of each node type defined in GrammarTypes, for a total of 15 subclasses (each in a separate file, of course):

public class SL_NopNode extends ParseNode {...}
public class SL_NumberNode extends ParseNode {...}
public class SL_IdentNode extends ParseNode {...}
public class SL_ProgramNode extends ParseNode {...}
public class SL_FunctionNode extends ParseNode {...}
public class SL_BlockNode extends ParseNode {...}
public class SL_BreakNode extends ParseNode {...}
public class SL_CallNode extends ParseNode {...}
public class SL_IfNode extends ParseNode {...}
public class SL_LetNode extends ParseNode {...}
public class SL_ReadNode extends ParseNode {...}
public class SL_ReturnNode extends ParseNode {...}
public class SL_WhileNode extends ParseNode {...}
public class SL_WriteNode extends ParseNode {...}
public class SL_ExprNode extends ParseNode {...}

The getType() and interpret() methods for each of these will be simple. For example:

public class SL_NopNode extends ParseNode { public int getType() { return G_NOP; } public ParseNode interpret(SymbolTable st) throws InterpretException, BreakException, ReturnException { return null; // to be implemented correctly in project #4 } }

So writing these classes will take a lot of cutting and pasting, but it'll be pretty simple. Here are a few twists, though, to keep things interesting:

To SL_NumberNode, add a constructor that allows you to set what number or value this node represents. Also add a getValue() method which returns the number. (Remember that the value is an arbitrary precision integer, which can be implemented using the BigInteger class in java.math.)
To SL_IdentNode, add a constructor that allows you to initialize the node with a string name. Also add a String getName() method.
To SL_FunctionNode, add several methods. int numArgs() should return the number of arguments this function accepts. String getArgName(i) should return the name of the ith parameter. ParseNode getBody() should return the body block of the function.
To SL_ExprNode, add a constructor that allows you to initialize the node with any operator type (both binary and unary allowed). Also add an int getOperator() method.

Now there's one last issue to take care of: what do we do with children? For each of the 15 subclasses, you'll need to come up with a layout of what its children are. For example, the children of SL_WhileNode are (in order): SL_ExprNode, followed by SL_BlockNode. Why? Because a while statement is made up of two things in this order: first, a conditional expression, followed by a body to be executed. Notice that order is extremely important! The first child of SL_WhileNode contains a conditional expression, while the second child of SL_WhileNode contains a block of statements.

Another example is SL_FunctionNode, which has 1+n+1 children. The first is of type SL_IdentNode and represents the name of the function; next are n SL_IdentNodes, each naming the parameters for the function, in order; finally, the last child is of type SL_BlockNode, and it represents the body of the function. You should work out such schematics for the 13 remaining node types. For the order of the children, keep it in the same order that you would write them in SL code. The above two examples should serve as good guides.

One last word about all these classes. As noted, you are allowed to add more stuff to all of them (except GrammarTypes). However, be sure that those things you add are all private, so that only the publicly published methods are used outside that class. Say, for example, that you add a method public int foo() to SL_WhileNode, then you make calls to foo() from inside your parser. Well, if someone else then takes your parser and tries to run it on his or her own version of SL_WhileNode, it won't work anymore, because that version of SL_WhileNode doesn't define foo(), which is not in the specs. This is precisely what we will be doing to code to grade it: after we take it for a quick spin to make sure your code all works together, we'll then replace your SL_*.java with our versions of them that implement all the interpreter functions. Then we'll run it again, and if you followed the specs correctly, it should actually not only parse but also execute SL programs.

4. From Grammar to Code

And now, we get to the heart of the project, the Parse.parse() method. To construct the parser, you should basically write a bunch of mutually recursive routines that look like the grammar. Say, just as an example, that you're interested in parsing a for statement with this grammar (which is not in SL):

FOR -> for IDENT = EXPR to EXPR do BLOCK Then you'd write something like:

parse_FOR() { match_for(); // match the keyword "for" parse_IDENT(); // match an identifier match_assn(); // match the assignment symbol parse_EXPR(); // match an expression match_to(); // match the keyword "to" parse_EXPR(); // match an expression match_do(); // match the keyword "do" parse_BLOCK(); // match a block }

You will have a bunch of private methods that look like the above. Some will match terminal tokens, which you can check by calling getToken() on your lexer and doing a comparison; others will match nonterminals, which translate into more recursive calls on the appropriate functions.

As written above, this parser won't be very useful. Why? Because it doesn't construct the parse tree as it goes along. So at the end of the parsing, the parser will be able to tell you whether your program is syntactically well-formed, but it won't be able to tell you which grammar rules it used to parse it. This is a lot like the CYK algorithm as demonstrated in section: as you discovered in Homework 4 part III, you have to modify the algorithm slightly to make it possible to keep track of your parse history. In this case, you'll want to build up a tree-like data structure (rooted at the Parser's tree object) as you parse the input.

You'll notice as you're constructing your parse tree that GrammarTypes doesn't define a node type for every possible nonterminal and terminal in the grammar. This is because not all of them are really needed. For example, STATEMENT nodes are not needed, because we can separately instantiate each of the possible types of statements: BREAK, CALL, IF, etc. You should be able to implement SL quite handily with just the 15 types of nodes defined above.

One issue of some concern is that of error detection. If you have a syntactically flawed program, you should output a message that includes the line number and actual line text, and as useful a diagnosis as possible. You should use exceptions wisely for this purpose (see below for more details on exceptions). Of course, a lexically flawed program should behave as before: the lexer should take care of lexical errors. You should halt execution after the first lexical or syntactic error.

One last note: consider a case where you have an arbitrarily long list of something, like an argument list or a parameter declaration list. Here's an example using a hypothetical grammar for a list of numbers:

LIST = ELEMENT $

number

parse_LIST() { while (lex.getToken() != '$') { // This is a bug! parse_ELEMENT(); } } parse_ELEMENT() { match_number(); }

Do you see why this fails? The problem is in calling lex.getToken() to test to see if the next token is the list termination character, '$'. If it is, we're happy, because we can break out of the while loop. But what if it isn't? That means that there's at least one more number in the list, and as a matter of fact, we just read it when we called lex.getToken(); because that gave me the next number in the list, it obviously didn't match '$'. But now we go to parse the number, except we've lost it, because it was read for testing and is no longer in the stream!

The problem here is that we cannot detect the end of an arbitrarily long list of elements unless we can "peek" ahead one token. You've already seen this happen in your lexer, where you had to peek ahead one character to distinguish between the ">" and ">=" operators. Now you must look ahead by a single token, instead of a single character; but the idea is still the same. This is why we asked you to implement the pushBack() method in the lexer.

5. Loose Ends

This project needs a couple of exception classes. You'll need to write the run-time exceptions public class InterpretException, public class BreakException, and public class ReturnException just to get everything to compile, but you won't be throwing any of them for now. These are thrown by interpret() in class ParseNode. But since you won't be invoking method interpret(), you shouldn't need to catch these exceptions either.

However, you must write and use public class ParseException to handle parsing errors. Make sure that your parser handles parsing errors with grace and intelligence. At the very least tell the user where the parsing error occurred. After the parser detects a parsing error, it should halt execution immediately. Finally, you must not throw any exceptions at the main() level.

Next, as with the previous project, you will need to write a front-end main class to be able to test drive your parser. Call it public class Proj3. It should read in an SL source code file from standard input, parse it, and print out the final parse tree using pre-order, depth-first traversal. Nonexistent or inscrutable print output will be penalized.

Your final tally of files to hand in:

proj3-grammar.txt
all the *.java files needed for the parser
all the *.java files needed for the revised lexer

All these files should be in the same submission directory. Do not use subdirectories.

Don't forget that all your files must begin with a header that contains:

your names
your group ID
the project name (Project #3: SL Parser)
your grader's name (same as your TA for most of you)
the name of the file

For those of you whose graders are requiring you to hand in a hard copy, please order your printouts in the order listed above, then staple them all together. (You may cat the files together to save paper, so long there's some clear mark where one file ends and the next one starts.)

Sample SL programs and parse trees.
Parse tree format.

Keep an eye on the class bboard for clarifications and updates.

END OF ASSIGNMENT