Programming Assignment 2

15-212, Spring 1998

This assignment is due before 9am on 6 Mar 1998.

Follow the instruction on the course web page for submitting your solution. Submit one solution per group. All your source files should be in the same directory. Do not use subdirectories. To submit your solution, you should copy just your source files to the submission directory. In particular, do not carry out your program development in the submission area.

Points will be given for good documentation and style. See the coding style document on the course web page. For this assignment, there'll be a 60-40 split for correctness and style, respectively.

See the course web page on collaboration and late submission policies. Make sure your working directory is private and cannot be read by others!

Prelude to the next few assignments:

This assignment is the first in a series of three on building a compiler for a small language we will call SL. As you tackle these projects, you will get to put the theory you have learned into practice. The projects will build on one another, and thus it will be to your advantage to write clean, modular, and well documented code.

A compiler consists of three broad stages, which will roughly correspond to the three assignments:

A lexical analyzer that breaks up the raw character stream in a source program into a stream of tokens, which include keywords (such as if else while), delimiters (such as + - * / = ; { >= ), identifiers (such as x foo nStudents), and integers (such as 8 11 0 178123222432). Tokens form the terminal symbols of the context free grammar for the language.
A parser that takes the stream of tokens generated by the lexical analyzer and constructs a parse tree for it, according to the CFG for the programming language.
A code generator that turns the parse tree into compiled code that can be executed or interpreted.

The three stages can be related to FSMs, PDA, and Turing machines. You will learn more about these relationships in the forthcoming weeks. In this assignment you will build the lexical analyzer. By the end of the semester, you'll have built an SL machine you can call your own.

The language SL

SL is a procedural language, a simplified version of C. SL supports only one type, namely integers. However, there is no bound on its size. (Java provides class BigInteger in java.math to support such objects.)

The basic lexical units or lexemes in SL are the following strings:

Reserved keywords:
if else while break read write function declare return
Identifiers, given by the regular expression:
([a-z] | [A-Z]) ([a-z] | [A-Z] | [0-9])^* (Note that keywords are not identifiers, even though they are covered by the above regular expression.)
Integers, given by the regular expression: [0-9][0-9]^*
Various operator or delimiter symbols:
(numeric) + - * / %
(relation) < > != == >= <=
(assignment) =
(others) { } ( ) ; ,

Note that according to the above definition, both identifiers and integers can be arbitrarily long strings.

Whitespace characters (tabs, newlines, spaces) may be present between, but not within lexemes. (That is, re turn is different from return, and 123 456 is two integer strings, not one.) Also, one or more whitespaces are required between any adjacent pair of keywords, identifiers, and integers. (Otherwise return 23 would be indistinguishable from return23.)

This brief description of SL doesn't describe the syntax (how to write valid programs) or semantics (what a program actually does) of SL, but only the lexicon (the valid lexeme strings) of SL. However, this is all you need to write a lexer for SL.

This Assignment: The Lexical Analyzer:

The input to the lexical analyzer will be from the standard input (usually a redirected file) as in assignment 1. The lexical analyzer (or lexer) breaks up the input text stream into lexemes and returns a token object for each one. Basically, the lexer is a variant of an FSA: it is a finite state transducer. Its FSA reads the input characters, and whenever it recognizes or accepts a certain lexeme, it outputs or returns a corresponding token object.

You can think of a token as a type or class of lexemes. SL has the following tokens:

A distinct token corresponding to each keyword string in SL.
The token Ident, representing the entire set of identifier strings.
The token Number, representing the entire set of integer strings.
A distinct token for each operator or delimiter symbol string in SL.

Tokens have associated attributes. For example, the Ident token has an associated identifier string. The Number token has an associated digit string. And the keyword and operator symbol tokens have their corresponding strings shown above.

The major part of this assignment is to construct the classes described below. (You may, of course, construct further ones if deemed necessary). Classes Token, SymbolTable, and Lexer will be needed in subsequent assignments.

Class Token

This is the main class for all tokens. You should also define subclasses for each distinct token type described earlier. Each subclass may include an appropriate set of attributes.

Class Lexer

The lexical analyzer. Among its members it should include the following public method:

    Token getToken()

that scans for and returns a token object for the next lexeme found in the input. Also, this function should return an appropriate indication upon end of input (such as null).

The input stream might contain data that cannot be recognized as any lexeme. For example, a quote character is illegal input (there is no quote symbol in SL). Likewise, the "lexeme" 123a is illegal since it is not any of the valid tokens. Upon any such error, the getToken() method should print an error message giving the line number where the error occurred, and the offending lexeme or substring itself. It should then immediately exit, ending program execution.

Class SymbolTable

The symbol table is one that contains an entry for each unique lexeme found in the input source program. For example, if the identifier xyzzy is encountered one or more times in the input, the symbol table should contain exactly one entry for it. Likewise, if the input contains the integer string 1234567890 (perhaps several occurrences), there should be one entry for it in the table. Each entry in the symbol table should also contain some means of identifying the associated token type.

This class should include the following public methods:

insertSymbol(lexeme), where lexeme is some string type. (Remember there are many ways to build strings in Java. Pick an appropriate one.)
lookupSymbol(lexeme), which looks for the given lexeme in the table, and returns the table entry if found, and some other indication if not.
printSymbolTable(), which prints the entire symbol table in some readable format.

Because the set of identifier and integer strings is unlimited, there is no maximum limit to the size of the symbol table. It may grow to arbitrary size depending on the number of entries created. Because of this, the lookupSymbol() method should be implemented efficiently. It should not simply search through the entire set of entries. (Hint: Use a hash table.)

Class Proj2

This class contains the main method for this assignment. It is just a wrapper for exercising the lexer, building the symbol table, and producing test output. It will not be used in future assignments.

In addition to whatever initialization is needed for the Lexer and Token classes, the symbol table should be created inside the main method. The main method should then initialize it with entries for all the reserved keywords and operator symbols in the language. It should then repeatedly invoke the getToken() method to obtain the next token in the input, and insert the associated lexeme in the SymbolTable if not already there. In addition, it should print the lexeme information as described in the following section.

Output From Your Program:

The main method in class Proj2 should print the following information. For each token returned when it invokes getToken, it should print one line containing two strings, the lexeme and the token.

To illustrate, let's say the file sample.sl contains the following:

	{
	    while (i < 0)))) {i = anIdent234 12
	    if i22!=0 return 4;
        }

Then, after compiling your code into Proj2.class and typing cat sample.sl | java Proj2, you should get the following output:

	{		{
	while		while
	(		(
	i		Ident
	<		<
	0		Number
	)		)
	)		)
	)		)
	)		)
	{		{
	i		Ident
	=		=
	anIdent234	Ident
	12		Number
	if		if
	i22		Ident
	!=		!=
	0		Number
	return		return
	4		Number
	;		;
	}		}

(We have deliberately chosen the above example input to emphasize that the lexical analyzer is concerned merely with separating the tokens, not with syntax checking.)

Some programs are not lexically correct. For example, since SL doesn't allow string literals, the following SL code is lexically incorrect:

	{
	    while (i < 0)))) {i = anIdent234 12
	    if i22!=0 return "foobar";
        }

As mentioned earlier your lexical analyzer should, if it detects a lexically illegal input, print out an error message. This should include the source line number containing the error, and the erroneous text. It should then halt execution immediately.

Keep an eye on the class bboard for clarifications and updates.

END OF ASSIGNMENT