Due Tue Nov 26, Noon (electronically).
No late assignments will be accepted!
Maximum Points: 100 [+20 extra credit]
In this part of the assignment, you will be asked to write a program that reads a file containing an ML signature, gets rid of all the comments and write an HTML file containing slots to be filled in with documentation for the various specifications.
<sigs> ::= <sigdecl> | <sigdecl> ; <sigdecl> ::= signature <idf> = <sigexp> <sigexp> ::= sig <spec> end | <idfseq> <spec> ::= val <idf> : <idfseq> | type <typedesc> | eqtype <typedesc> | datatype <datadesc> | exception <idfseq> | structure <idf> : <sigexp> | include <idf> | <spec> <spec> <typedesc> ::= <idfseq> | <idfseq> = <idfseq> <datadesc> ::= <idfseq> = <idfseq> | <datadesc> and <datadesc> <idfseq> ::= <idf> | <misc> | <idf> <idfseq> | <misc> <idfseq>The terminal symbols you are requested to recognize are highlighted. In addition to this, the lexical analysis should recognized and report identifiers (<idf>) and miscellaneous characters (<misc>), and handle (possibly nested) comments; we will come back to these points below. This grammar focuses the identifiers that are declared in a signature and glosses over the remaining parts of their declaration. For example, when analysing a val declaration, we look for this keyword, then for an identifier followed by a colon, but we ignore the structure of the type that comes afterward: we interpret it simply as a sequence of identifiers and special characters.
In the SML'96 specification, identiers are distinguished into alphanumeric and symbolic:
A special character is any among ()[]{}.,".
A comment starts with (* and ends with *). Comments can be nested.
You are requested to write a lexer that, given a stream of characters from a file containing an SML signature, produces a stream of tokens as described above. More precisely, you are requested to implement a functor Lexer () :> LEXER realizing the following signature (it can be found in the file lexer.sml):
signature LEXER = sig datatype token = IDF of string (* Identifiers *) | DOC of string * string (* (* EXPRESSION ... MEANING ... *) *) | MISC of string (* Misc characters *) | SIGNATURE (* signature *) | EQUAL (* = *) | SIG (* sig *) | END (* end *) | SEMICOLON (* ; *) | INCLUDE (* include *) | VAL (* val *) | COLON (* : *) | EXCEPTION (* exception *) | TYPE (* type *) | DATATYPE (* datatype *) | EQTYPE (* eqtype *) | STRUCTURE (* structure *) | AND (* and *) exception Error of string val lex : char MStream.stream -> token MStream.stream val toString : token MStream.stream -> string end; (* signature LEXER *)The constructors of type token represent the input tokens displayed on their right. Both alphanumeric and symbolic identifiers are scanned to the token IDF, which argument is the identifier itself. Ignore the token DOC unless you want to tackle Question 1.4. Special characters are returned as the argument of the token MISC.
The exception Error should be raised for any of the following two reasons: either a character not mentioned above has been encountered, or comments are not propertly nested (in particular, if the input stream ends while reading a comment)
The function lex performs lexical analysis as described above. The files parser.lex describes the stream of lexical tokens expected from running lex on the parser you are going to use (parser.sml). A smaller example is contained in example.lex.
The function toString rewrites a stream of tokens, one token per line, as a string. It is a great debugging tool.
You can find implementations of streams in the file stream.sml and of the functions to transform a file into a stream of character in the file mstream_io.sml.
The type of parse trees is given as follows:
datatype sigs = Sigs of sigdecl list and sigdecl = Sigdecl of string * sigexp (* one element *) and sigexp = SigexpSpec of spec | SigexpIdf of idfseq and spec = SpecVal of string * idfseq | SpecType of typedesc | SpecEType of typedesc | SpecDType of datadesc | SpecEx of idfseq | SpecStr of string * sigexp | SpecIncl of string | SpecSpec of spec * spec and typedesc = Typedesc of idfseq * idfseq option and datadesc = Datadesc of (idfseq * idfseq) list (* non empty *) and idfseq = Idfseq of string list (* non empty *)Each datatype corresponds to the non-terminal symbol with the same name in the grammar in Question 1.1. Each constructor corresponds to a grammatical production. Constructors and productions are given in the same order. There are the following exceptions to this rule:
The resulting HTML document should be divided into three parts separated by horizontal lines (HTML tag <HR>): some header (something similar to or better than what is provided in the example), the formatted signature, and a documentation area.
The minimum formatting we require is as follows.
You are requested to implement the the functor Ml2Html (structure Parser : PARSER) :> ML2HTML that realizes the signature ML2HTML below (you can find it in the file ml2html.sml).
signature ML2HTML = sig structure Parser : PARSER (* parameter *) exception Error of string val ml2html : Parser.sigs * Parser.documentation -> string end; (* signature ML2HTML *)Given a structure Ml2Html realizing this signature, the function Ml2Html.ml2html generates HTML code as described above. As already said, you should not be concerned with the second argument of this function unless you tackle Question 4.1.
The exception Error should be raised if the input parse tree violates the previously stated constraints (for example if a list of identifiers is empty).
signature TOP = sig structure Parser : PARSER val document : string -> unit end; (* signature TOP *)Once realized, the function document should take as input a file name ending with the extension .sig or .sml, scan it, parse it, and produce an HTML file according to the specifications in Question 1.2. The name of the output file should be identical to the name of the input file, with the extension changed to .html. An error message should be printed if any error occurs (the input file cannot be open or the lexer, the parser, or the output generator raise an error). Remember to close all your files in case something goes wrong.
The parser will take these tokens into account and return them together with the generated parse tree as a list of pairs (key,meaning) (of type documentation).
The output generation function accepts documentation in this format and fills the appropriate slots of the output HTML code with it. For example, assume that your input signature declares the value fact and that the documentation list contains an item of the form ("fact n", "computes the factorial of n"), then the key corresponding to fact should be set to the first component of this pair ("fact n") rather than simply to "fact", and the definition part should be set to the second component of the pair.
If you intend to answer this question, state it clearly at the beginning of all your modules.
We think of a parser as transforming a stream of tokens into a stream of abstract syntax trees. This intuition can be captured by a function which reads some tokens from a stream, produces the first element of the output stream plus the remaining stream of tokens. Since a general module should be independent of the set of tokens or abstract syntax trees, we define
type ('a, 'b) parser = 'a MStream.stream -> ('b * 'a MStream.stream)where we think of 'a as the type of tokens, and 'b as the type of abstract syntax trees.
The task of a precedence parser is to transform a stream of operators and operands and build the correct abstract syntax tree, according to the precedence of the operators and grouping constructs such as parentheses. For infix operators, we also need an associativity which determines if a @ b @ c is parsed as (a @ b) @ c (left associative) or a @ (b @ c) (right associative), if @ is an infix operator. If we think of an operand as simply an operator without arguments, we obtain the following definitions.
type prec = int (* Precedence *) datatype assoc = Left | Right | None (* Associativity *) datatype 'b operator = Infix of prec * assoc * ('b * 'b -> 'b) (* binary operator *) | Prefix of prec * ('b -> 'b) (* unary prefix operator *) | Postfix of prec * ('b -> 'b) (* unary postfix operator *) | Atom of 'b (* nullary operator = operand *) | LeftDelimiter (* left delimiter, often "(" *) | RightDelimiter (* right delimiter, often ")" *) | Terminator (* end of 'b operators *)We see that the meaning of each operator or operand is supplied as a function on abstract syntax trees of appropriate arity. The precedence parser now has type
val precParse : ('a, 'b operator) parser -> ('a, 'b) parserthat is, given a parser for operators it returns a parser for abstract syntax trees.
Precedence Associativity Token e ::= e1 & e2 1 right AMPERSAND | e1 | e2 1 right BAR | ~ e 2 TILDE | e1 = e2 3 none EQUAL | e2 < e2 3 none LESS | e1 + e2 4 left PLUS | e1 - e2 4 left MINUS | e1 * e2 5 left STAR | e1 / e2 5 left SLASH | e1 ^ e2 6 right UPARROW | # e 7 HASH | e ! 7 EXCL | e % 7 PERCENT | <integer> INTEGER(n) | ( e ) LPAREN RPAREN exp ::= e ; SEMICOLONModify the lexer code from class (see the file /afs/andrew/scs/cs/15-212-X/code/lecture21.sml) for this revised language and set of tokens.
Some examples:
/afs/andrew/scs/cs/15-212-X/studentdir/<your andrew id>/ass6,
Complete the signature files without changing their names. In particular, do not collapse the whole code in a single file.