Project #3: SL Parser
15-212, Spring 1998
This assignment is due before 5:00pm on Friday, 20 Mar 1998.
Please note that this is a firm deadline; no late
submissions will be accepted. This assignment is
arguably the most intricate (but also the most
interesting) of the three SL assignments. Get started early!
|
Follow the instruction on the course web page for
submitting your solution. Submit one solution per
group. All your source files should be in the same
directory. Do not use subdirectories. To submit your
solution, you should copy just your source files to the
submission directory. In particular, do not carry out
your program development in the submission area.
|
Points will be given for good documentation and style.
See the coding style document on the course web page. For
this assignment, there'll be a 80-20 split for correctness
and style, respectively, and there will be another 40
points allocated for the written answers from part 1, for
a total of 140 points.
|
See the course web page on collaboration and late
submission policies. Make sure your working directory is
private and cannot be read by others!
|
Overview of the Assignment:
You already know, from homework 4, that given a context-free grammar G and
a string T, a parser can do the following:
- Decide whether G can generate T (i.e. "T is in the grammar G").
- Produce a parse tree of T, which shows how to rewrite the start symbol S
in G to produce T.
Your job will be to construct a parser to do both of the above tasks. Your
program will take as its input T programs in SL; if T is in G, your program
will produce a parse tree for T.
We will grade your program in two ways. First, we will feed it certain SL
programs--some valid, some not--and check that your program outputs a valid
parse for syntactically valid programs, and a meaningful error message
for invalid programs. Second, we will plug your parser code into a working SL
interpreter, replacing our interpreter's existing (correct) parser with yours;
we'll then run the new interpreter on some SL code to see if it still runs
properly. To pass the second test, your code will have to adhere tightly to
the specs below.
1. The Grammar for SL:
Let us first examine the context-free grammar (SIGMA, V, S, R) for SL:
SIGMA = { PROGRAM, FUNCTION, PARAMLIST, PARAMREST, BLOCK,
STATEMENT, BREAK, CALL, ARGLIST, ARGREST, IF, ELSE, LET, READ,
RETURN, WHILE, WRITE, EXPR, BINOP, UNOP, number,
identifier, function, break, call, if, else, let, read,
return, while, write, (, ), {, }, ,, ;, =, +, -, *, /, %, <,
>, <=, >=, ==, !=, &, |, ~, !}
V = { number, identifier, function, break, call, if,
else, let, read, return, while, write, (, ), {, }, ,, ;, =, +, -,
*, /, %, <, >, <=, >=, ==, !=, &, |, ~, !}
S = PROGRAM
R is given below:
- PROGRAM -> FUNCTION* BLOCK
- FUNCTION -> function identifier ( PARAMLIST ) BLOCK
- PARAMLIST -> identifier PARAMREST | e
- PARAMREST -> , identifier PARAMREST | e
- BLOCK -> { STATEMENT* }
- STATEMENT -> BREAK | CALL ; | IF | LET | READ
| RETURN | WHILE | WRITE
- BREAK -> break ;
- CALL -> call identifier ( ARGLIST )
- ARGLIST -> EXPR ARGREST | e
- ARGREST -> , EXPR ARGREST | e
- IF -> if EXPR BLOCK ELSE
- ELSE -> else BLOCK | e
- LET -> let identifier = EXPR ; | let
identifier = CALL ;
- READ -> read identifier ;
- RETURN -> return EXPR ;
- WHILE -> while EXPR BLOCK
- WRITE -> write EXPR ;
- EXPR -> number | identifier | ( EXPR ) |
( UNOP EXPR ) | ( BINOP EXPR EXPR )
- BINOP -> + | - | * | / | % | & | | | < | > |
<= | >= | == | !=
- UNOP -> ~ | !
As always, e is epsilon, the empty string. Also, note that an SL program
is a sequence of function declarations followed by the main body.
This description describes the syntax (how to write valid programs) of
SL, but not the semantics (what a syntactically valid program actually
does). We'll get to the semantics in the next assignment, but drawing on
your knowledge of Java, the semantics of SL should be fairly obvious just from
the grammar.
Your first task is to answer the following problems based on the
grammar above. Put your answers in a text file called
proj3-grammar.txt . Just one write-up per group is
needed. See the note at the very end of this document concerning
the header information you need at the top of this file.
- Write the grammatically correct SL expressions for the
following Java arithmetic expressions (watch your operator
precedence for some of these!), and also give their final
values. You may want to run it through Java to double-check.
Assume that a==1, b==2, c==3, d==4, e==0.
- 1+2+3+4+5
- 1*(2+3)-4/5
- -1+2+-3+4+-5
- a && (!b || c) || !d && e
- (a==1 && b==-2 || c!=-3) && !d!=!-e
- a>=1 || b<=c || d>3 || e%5<4
- Mr. Bar is confused about the difference between
Java grammar and SL grammar. Locate everything that is wrong
with his not-quite-SL program below, and for
each: 1) indicate which SL grammar rule(s) it disobeys; and 2)
rewrite it correctly in SL.
{
function A(a,b) {
if (a < b)
else
return 0;
}
main {
}
function B(a) {
n = - A(a,3) + 2 * A(a,4);
return n+6;
}
}
Extra credit (1 pt): What is the final output of this
program?
- (Note: This question is actually a parser question, not a
grammar question, so you may want to wait until you're writing
the parser to answer this one.)
- A) As we've noted before, a parser is really just a
push-down automaton (DPDA). Briefly demonstrate this
with regard to the top-down parser that you are now writing
by answering these questions. In the parser:
- What corresponds to the DPDA states?
- What corresponds to the push-down store?
- What corresponds to the input tape?
- What is in each square of the input tape?
- B) The one lexeme lookahead feature that we use in our
parser may strike you as being bogus, because in DPDAs, you
can't push an already read input back into the input stream.
Explain briefly why this is actually acceptable;
i.e., explain how you would simulate this lookahead feature
using a DPDA.
2. Patching Up the Lexer
Some of the keywords in the grammar above should be unfamiliar to
you, and you'll notice that one of the keywords you implemented in
Project #2 is no longer used. The first part of this assignment
is to make some modifications to your lexer to accept this revised
SL specification. Assuming that you adhered closely to OOP
principles while writing your lexer, making these changes will
be trivial.
Things to do to the lexer (in addition to whatever else you might
find necessary):
- Remove declare as a keyword.
- Add call and function, which will allow you
to make function calls.
- Add let, which will allow you to make variable
assignments.
- Add "&", which will be a logical and operator (like && in C).
- Add "|", which will be a logical or operator (like || in C).
- Add "!", which will be the logical not operator.
- Add "~", which is the unary negation operator.
- Add a method called void putBack(Token t), which puts the
token t back into the stream, so to speak. Notice that
this backing-up capability is only one token deep; in other words,
if you do multiple putBack() calls, only the most recent
one matters. For example:
lex.putBack(t1);
lex.putBack(t2);
lex.getToken(); // == t2
lex.getToken(); // != t1; it's the next unread token from the stream
3. Setting Up the Parser
You will probably want to implement a top-down parser for this
assignment. If you really want to implement a bottom-up parser,
feel free to do so, but be warned that the TAs may not be able to
help you as effectively if you get stuck. The top-down parser is
also more intuitive and visually similar to the grammar.
First, you must create a Parser class, the main class for this assignment:
public class Parser {
public Parser(InputStream s) { /* to be filled in by you */ }
public void parse() { /* t.b.f.i.b.y. */ }
public void print() { /* t.b.f.i.b.y. */ }
public void interpret() { /* t.b.f.i.b.y. */ }
private ParseNode tree;
private SymbolTable symtbl;
/* add other stuff as you need them */
}
This is the primary class for this assignment. When you create a
Parser object, you pass it a stream that it needs to
parse. The parser should create for itself a lexer so that it can
chop up the stream into tokens. When parse() is called,
the parser actually performs the parsing of the stream, then holds
onto the root of the resulting parse tree as tree. At the
end of the parsing, the lexer will also have completely filled out
a symbol table. The parser needs access to that table, and saves it
as symtbl. If your implementation has the symbol table
separate from the lexer, this is obviously easy to do. If your
implementation has the symbol table as a member object inside the
lexer class, you'll need to add a method to your lexer that allows
the parser to get a copy of the table.
print() simply
prints out the parse tree in a readable text format. The tree should
be printed out using pre-order, depth-first traversal. (The
criterion for readability is that someone who is familiar with this
project -- e.g., a TA -- can stare at the output and deduce what your
printout means without having to ask you for an explanation.) This
method will also, without a doubt, be your friend in debugging.
interpret() begins executing the parse tree; it assumes
that tree is fully parsed and symtbl is filled in
properly. You will concern yourself with interpret() in the next
assignment.
So far so good. The only tricky part seems to be parse().
Before we get into that, however, we need to look at this
mysterious class called ParseNode. Notice that
tree is of that type, so perhaps you have already guessed
that ParseNode is what the parse tree is made of; they are
in fact the nodes of the parse tree, each containing pointers to
children nodes. Let's look at it a bit more closely:
public abstract class ParseNode implements GrammarTypes {
public ParseNode() {...}
public int numChildren() {...}
public ParseNode getChild(int i) {...}
public void addChild(ParseNode node) {...}
public abstract int getType();
public abstract ParseNode interpret(SymbolTable st) throws
InterpretException, BreakException, ReturnException;
public abstract void print(PrintStream ps);
/* define other stuff as you need them */
}
public interface GrammarTypes {
public final static int G_NOP = 0;
public final static int G_NUMBER = 1;
public final static int G_IDENT = 2;
public final static int G_PROGRAM = 3;
public final static int G_FUNCTION = 4;
public final static int G_BLOCK = 5;
public final static int G_BREAK = 6;
public final static int G_CALL = 7;
public final static int G_IF = 8;
public final static int G_LET = 9;
public final static int G_READ = 10;
public final static int G_RETURN = 11;
public final static int G_WHILE = 12;
public final static int G_WRITE = 13;
public final static int G_EXPR = 14;
// you may NOT modify this interface!!!
}
Whew! Lots of stuff here. First, notice that ParseNode
is an abstract class, so you can't instantiate it. What you will
do is create many subclasses of ParseNode (more details
below), but all of them will have at least the stuff defined above.
Remember that each node maintains a list of pointers to its
children. numChildren() returns the number of children
that this node has. getChild(i) returns the
ith child of this node. addChild(node)
appends node to the list of children of this node.
getType() returns the type of the node, as defined by one
of the constants in GrammarTypes (again, see below for
details). print() prints out the contents of the node in
a readable format; this will be useful for debugging. (Hint: if
you implement these print() methods recursively, you'll be
able to greatly simplify writing Parser.print().)
Notice that you'll need some data structure to store an ordered
list of children. (Please note that it must be ordered; in
other words, something like a Vector.) Once you have this, you can
implement the necessary subclasses of ParseNode. There is a
subclass of each node type defined in GrammarTypes, for a
total of 15 subclasses (each in a separate file, of course):
- public class SL_NopNode extends ParseNode {...}
- public class SL_NumberNode extends ParseNode {...}
- public class SL_IdentNode extends ParseNode {...}
- public class SL_ProgramNode extends ParseNode {...}
- public class SL_FunctionNode extends ParseNode {...}
- public class SL_BlockNode extends ParseNode {...}
- public class SL_BreakNode extends ParseNode {...}
- public class SL_CallNode extends ParseNode {...}
- public class SL_IfNode extends ParseNode {...}
- public class SL_LetNode extends ParseNode {...}
- public class SL_ReadNode extends ParseNode {...}
- public class SL_ReturnNode extends ParseNode {...}
- public class SL_WhileNode extends ParseNode {...}
- public class SL_WriteNode extends ParseNode {...}
- public class SL_ExprNode extends ParseNode {...}
The getType() and interpret() methods for each of
these will be simple. For example:
public class SL_NopNode extends ParseNode {
public int getType() { return G_NOP; }
public ParseNode interpret(SymbolTable st) throws
InterpretException, BreakException, ReturnException
{
return null; // to be implemented correctly in
project #4
}
}
So writing these classes will take a lot of cutting and pasting,
but it'll be pretty simple. Here are a few twists, though, to keep
things interesting:
- To SL_NumberNode, add a constructor that allows
you to set what number or value this node represents. Also add
a getValue() method which returns the number.
(Remember that the value is an arbitrary precision
integer, which can be implemented using the
BigInteger
class in java.math
.)
- To SL_IdentNode, add a constructor that allows
you to initialize the node with a string name. Also add a
String getName() method.
- To SL_FunctionNode, add several methods. int
numArgs() should return the number of arguments this
function accepts. String getArgName(i) should return
the name of the ith parameter. ParseNode
getBody() should return the body block of the function.
- To SL_ExprNode, add a constructor that allows you
to initialize the node with any operator type (both binary and
unary allowed). Also add an int getOperator()
method.
Now there's one last issue to take care of: what do we do with
children? For each of the 15 subclasses, you'll need to come up
with a layout of what its children are. For example, the children
of SL_WhileNode are (in order): SL_ExprNode,
followed by SL_BlockNode. Why? Because a while
statement is made up of two things in this order: first, a
conditional expression, followed by a body to be executed. Notice
that order is extremely important! The first child of
SL_WhileNode contains a conditional expression, while the
second child of SL_WhileNode contains a block of
statements.
Another example is SL_FunctionNode, which has 1+n+1
children. The first is of type SL_IdentNode and
represents the name of the function; next are n
SL_IdentNodes, each naming the parameters for the
function, in order; finally, the last child is of type
SL_BlockNode, and it represents the body of the function.
You should work out such schematics for the 13 remaining node
types. For the order of the children, keep it in the same order
that you would write them in SL code. The above two examples
should serve as good guides.
One last word about all these classes. As noted, you are allowed
to add more stuff to all of them (except GrammarTypes).
However, be sure that those things you add are all private, so that
only the publicly published methods are used outside that class.
Say, for example, that you add a method public int foo()
to SL_WhileNode, then you make calls to foo()
from inside your parser. Well, if someone else then takes your
parser and tries to run it on his or her own version of
SL_WhileNode, it won't work anymore, because that version
of SL_WhileNode doesn't define foo(), which is
not in the specs. This is precisely what we will be doing to code
to grade it: after we take it for a quick spin to make sure your
code all works together, we'll then replace your SL_*.java
with our versions of them that implement all the interpreter
functions. Then we'll run it again, and if you followed the specs
correctly, it should actually not only parse but also execute SL
programs.
4. From Grammar to Code
And now, we get to the heart of the project, the
Parse.parse() method. To construct the parser, you should
basically write a bunch of mutually recursive routines that look
like the grammar. Say, just as an example, that you're interested in
parsing a for statement with this grammar (which is not in SL):
FOR -> for IDENT = EXPR to EXPR do BLOCK
Then you'd write something like:
parse_FOR() {
match_for(); // match the keyword "for"
parse_IDENT(); // match an identifier
match_assn(); // match the assignment symbol
parse_EXPR(); // match an expression
match_to(); // match the keyword "to"
parse_EXPR(); // match an expression
match_do(); // match the keyword "do"
parse_BLOCK(); // match a block
}
You will have a bunch of private methods that look like
the above. Some will match terminal tokens, which you can check by
calling getToken() on your lexer and doing a comparison;
others will match nonterminals, which translate into more recursive
calls on the appropriate functions.
As written above, this parser won't be very useful. Why? Because
it doesn't construct the parse tree as it
goes along. So at the end of the parsing, the parser
will be able to tell you whether your program is syntactically
well-formed, but it won't be able to tell you which grammar rules
it used to parse it. This is a lot like the CYK algorithm as
demonstrated in section: as you discovered in Homework 4 part III,
you have to modify the algorithm slightly to make it possible to
keep track of your parse history. In this case, you'll want to
build up a tree-like data structure (rooted at the Parser's
tree object) as you parse the input.
You'll notice as you're constructing your parse tree that
GrammarTypes doesn't define a node type for every possible
nonterminal and terminal in the grammar. This is because not all
of them are really needed. For example, STATEMENT nodes are not
needed, because we can separately instantiate each of the possible
types of statements: BREAK, CALL, IF, etc. You should be able to
implement SL quite handily with just the 15 types of nodes
defined above.
One issue of some concern is that of error detection. If you have
a syntactically flawed program, you should output a message that includes
the line number and actual line text, and as useful a diagnosis as
possible. You should use exceptions wisely for this purpose (see below
for more details on exceptions). Of course, a lexically flawed program
should behave as before: the lexer should take care of lexical errors.
You should halt execution after the first lexical or syntactic error.
One last note: consider a case where you have an arbitrarily long
list of something, like an argument list or a parameter declaration
list. Here's an example using a hypothetical grammar for a list of
numbers:
ELEMENT = number ELEMENT | e
parse_LIST() {
while (lex.getToken() != '$') { // This is a bug!
}
}
parse_ELEMENT() {
}
Do you see why this fails? The problem is in calling
lex.getToken() to test to see if the next token is the
list termination character, '$'. If it is, we're happy, because we
can break out of the while loop. But what if it isn't? That means
that there's at least one more number in the list, and as a matter
of fact, we just read it when we called lex.getToken();
because that gave me the next number in the list, it obviously
didn't match '$'. But now we go to parse the number, except we've
lost it, because it was read for testing and is no longer in the
stream!
The problem here is that we cannot detect the end of an arbitrarily
long list of elements unless we can "peek" ahead one token. You've
already seen this happen in your lexer, where you had to peek ahead
one character to distinguish between the ">" and ">="
operators. Now you must look ahead by a single token, instead of a
single character; but the idea is still the same. This is why we asked you
to implement the pushBack() method in the lexer.
5. Loose Ends
This project needs a couple of exception classes. You'll need to
write the run-time exceptions public class
InterpretException, public class BreakException, and
public class ReturnException just to get everything to
compile, but you won't be throwing any of them for now. These are
thrown by interpret()
in class ParseNode
.
But since you won't be invoking method interpret()
, you
shouldn't need to catch these exceptions either.
However, you must write and use public class
ParseException to handle parsing errors. Make sure that your
parser handles parsing errors with grace and intelligence. At the
very least tell the user where the parsing error occurred. After
the parser detects a parsing error, it should halt execution
immediately. Finally, you must not throw any exceptions at
the main() level.
Next, as with the previous project, you will need to write a
front-end main class to be able to test drive your parser. Call it
public class Proj3. It should read in an SL source code
file from standard input, parse it, and print out the final parse
tree using pre-order, depth-first traversal. Nonexistent or
inscrutable print output will be penalized.
Your final tally of files to hand in:
- proj3-grammar.txt
- all the *.java files needed for the parser
- all the *.java files needed for the revised lexer
All these files should be in the same submission directory. Do not use
subdirectories.
Don't forget that all your files must begin with a header that contains:
- your names
- your group ID
- the project name (Project #3: SL Parser)
- your grader's name (same as your TA for most of you)
- the name of the file
For those of you whose graders are requiring you to hand in a hard
copy, please order your printouts in the order listed above, then
staple them all together. (You may cat the files together to save
paper, so long there's some clear mark where one file ends and the next one
starts.)
Sample SL programs and parse trees.
Parse tree format.
Keep an eye on the class bboard for clarifications and
updates.
END OF ASSIGNMENT