This assignment is due before 9am on 6 Mar 1998. |
Follow the instruction on the course web page for submitting your solution. Submit one solution per group. All your source files should be in the same directory. Do not use subdirectories. To submit your solution, you should copy just your source files to the submission directory. In particular, do not carry out your program development in the submission area. |
Points will be given for good documentation and style. See the coding style document on the course web page. For this assignment, there'll be a 60-40 split for correctness and style, respectively. |
See the course web page on collaboration and late submission policies. Make sure your working directory is private and cannot be read by others! |
if else while
), delimiters (such as
+ - * / = ; { >=
), identifiers (such as x foo
nStudents
), and integers (such as 8 11 0 178123222432
).
Tokens form the terminal symbols of the context free grammar for the language.
BigInteger
in java.math
to support such objects.)
The basic lexical units or lexemes in SL are the following strings:
if else while break read write function declare return
([a-z] | [A-Z]) ([a-z] | [A-Z] | [0-9])*
[0-9][0-9]*
+ - * / %
< > != == >= <=
=
(others) { } ( ) ; ,
re turn
is different from return
,
and 123 456
is two integer strings, not one.) Also, one or more
whitespaces are required between any adjacent pair of keywords, identifiers,
and integers. (Otherwise return 23
would be indistinguishable from
return23
.)
This brief description of SL doesn't describe the syntax (how to write
valid programs) or semantics (what a program actually does) of SL, but
only the lexicon (the valid lexeme strings) of SL. However, this is all
you need to write a lexer for SL.
Token getToken()
that scans for and returns a token object for the next lexeme found in the input.
Also, this function should return an appropriate indication upon end of input
(such as null
).
The input stream might contain data that cannot be recognized as any lexeme.
For example, a quote character is illegal input (there is no quote symbol in SL).
Likewise, the "lexeme" 123a
is illegal since it is not any of the
valid tokens. Upon any such error, the getToken()
method should
print an error message giving the line number where the error occurred, and the
offending lexeme or substring itself. It should then immediately exit, ending
program execution.
xyzzy
is encountered one or more times in the input, the symbol table
should contain exactly one entry for it. Likewise, if the input contains the
integer string 1234567890
(perhaps several occurrences), there should
be one entry for it in the table. Each entry in the symbol table should also
contain some means of identifying the associated token type.
This class should include the following public methods:
insertSymbol(lexeme)
, where lexeme is some string type.
(Remember there are many ways to build strings in Java. Pick an appropriate
one.)
lookupSymbol(lexeme)
, which looks for the given lexeme in the
table, and returns the table entry if found, and some other indication if not.
printSymbolTable()
, which prints the entire symbol table in some
readable format.
lookupSymbol()
method should be implemented efficiently. It should not
simply search through the entire set of entries. (Hint: Use a hash table.)
main
method for this assignment. It is just
a wrapper for exercising the lexer, building the symbol table, and producing
test output. It will not be used in future assignments.
In addition to whatever initialization is needed for the Lexer and Token classes,
the symbol table should be created inside the main
method. The
main
method should then initialize it with entries for all the
reserved keywords and operator symbols in the language. It should then repeatedly
invoke the getToken()
method to obtain the next token in the input,
and insert the associated lexeme in the SymbolTable
if not already
there. In addition, it should print the lexeme information as described in the
following section.
main
method in class Proj2
should print the following
information. For each token returned when it invokes getToken
, it
should print one line containing two strings, the lexeme and the token.
To illustrate, let's say the file sample.sl
contains the following:
{ while (i < 0)))) {i = anIdent234 12 if i22!=0 return 4; }Then, after compiling your code into
Proj2.class
and
typing cat sample.sl | java Proj2
, you should get the following
output:
{ { while while ( ( i Ident < < 0 Number ) ) ) ) ) ) ) ) { { i Ident = = anIdent234 Ident 12 Number if if i22 Ident != != 0 Number return return 4 Number ; ; } }(We have deliberately chosen the above example input to emphasize that the lexical analyzer is concerned merely with separating the tokens, not with syntax checking.) Some programs are not lexically correct. For example, since SL doesn't allow string literals, the following SL code is lexically incorrect:
{ while (i < 0)))) {i = anIdent234 12 if i22!=0 return "foobar"; }As mentioned earlier your lexical analyzer should, if it detects a lexically illegal input, print out an error message. This should include the source line number containing the error, and the erroneous text. It should then halt execution immediately.