Project 4: SmallC Parser
CMSC 330, Fall 2017 Due November 6th, 2017
Ground Rules and Extra Info
This is NOTa pair project. You must work on it alone.
In your code, you may use any non-imperative standard library functions (with the exception of printing), but the ones that will be useful to you will be found in the Pervasives
moduleand the List
module. You may choose to use imperative OCaml, but are not required to.
Introduction
In this project, you will implement the lexer and parser for the SmallC language. This parser will be capable of parsing expressions, statements, and full programs. The parser will operate on a flat token list
assembled by your lexer and create a correct stmt
and/or expr
corresponding to the input. When youre done, you will have written the complete pipeline to turn a text file into a running SmallC program!
The only requirement for error handling is that input that cannot be lexed/parsed according to the provided rules should raise an InvalidInputException
. We recommend adding reasonable error messages to these exceptions, as it will make debugging during development infinitely easier. As you test your program, you may input an invalid program by mistake or perform an incorrect parse that leads to a dead end and a silent error, but adding meaningful error messages to your program will help you get past these issues.
All tests will be run on direct calls to your code comparing return values to expected values. Any other output (e.g., for your own debugging) will be ignored. You are free and encouraged to have additional output.
Project Files
To begin this project, you will need to commit any uncommitted changes to your local branch and pull updates from the git repository. Click here for directions on working with the Git repository.The following are the relevant files:
- OCaml files you should edit
- lexer.ml : This file is the sole place you will implement lexer code for the first half of this project.
- parser.ml : This file is the sole place you will implement parser code for the second half of this project.
- Provided OCaml Files (No need to edit, changes will be overwritten!)
- interface.ml : This driver can be used to output your lexer/parser results on standard input or supplied files.
- public.ml and public_inputs/ : The public test driver file and the SmallC input files to go with it, respectively.
- smallCTypes.ml : This file contains all type definitions used in this project.
- utils.ml and testUtils.ml : These files contain functions that we have written for you and for us that aid in testing and debugging. The small part of utils.ml that concerns you in implementing this project is called out very explicitly when it is needed later in the document, and otherwise you should not need to look at either of these files.
- Submission Scripts and Other Files
- submit.rb : Execute this script to submit your project to the submit server.
- submit.jar and .submit : Dont worry about these files, but make sure you have them.
- Makefile : This is used to build the public tests and other project-specific targets by simply running the command
make
, just as in 216.
Compilation, Tests, and Running
In order to compile your project, simply run the make
command and our Makefile
will handle the compilation process for you. After compiling your code, two executable files will be created:
- public.byte
- interface.byte
The public tests can be run by simply running public.byte
(i.e. ./public.byte
in the terminal; think of this just like with a.out in C).
You can run your lexer or parser directly on a SmallC program by running ./interface.byte <lex/parse>
. This driver, provided by us, reads in a program from standard input (or from a file, if a second argument is provided) and performs the requested action (i.e. lex or parse) and prints information about the result of running the SmallC lexer and parser that you will write.
Note that you dont need to touch interface.ml
yourself, as it only functions as an entry point for your code and is structured independent of your exact implementation.
The Lexer
Before your parser can process input, the raw file must be transformed into logical units called tokens. This process is readily handled by use of regular expressions. Information about OCamls regular expressions library can be found in the Str
module documentation. You arent required to use it, but you may find it very helpful.
Your lexer must be written in lexer.ml
and implement the function tokenize : string -> token list
which takes a program as a string and outputs the associated token list. The token
type is implemented in smallCTypes.ml
.
Your lexer must meet these general requirements:
- Tokens can be separated by arbitrary amounts of whitespace, which your lexer should discard. Spaces, tabs (t) and newlines (
) are all considered whitespace. - The lexer should be case sensitive.
- Lexer input should be terminated by the
EOF
token, meaning that the shortest possible output from the lexer is[EOF]
. - If the beginning of a string could be multiple things, the longest match should be preferred, for example:
- while0 should not be lexed as
Tok_While
, but asTok_ID("while0")
, since it is an identifier - 1-1 should be lexed as
[Tok_Int(1); Tok_Int(-1)]
since -1 is a longer match than just -.
- while0 should not be lexed as
Most tokens only exist in one form (for example, the only way for Tok_Pow
to appear in the program is as ^
and the only way for Tok_While
to appear in the program is as while
). However, a few tokens have more complex rules. The regular expressions for these more complex rules are provided here:
-
Tok_Bool of bool
: The value will be set totrue
on the input string true andfalse
on the input string false.- Regular Expression :
true|false
- Regular Expression :
-
Tok_Int of int
: Valid ints may be positive or negative and consist of 1 or more digits. You may find the functionint_of_string
useful in lexing this token type.- Regular Expression :
-?[0-9]+
- Regular Expression :
-
Tok_ID of string
: Valid IDs must start with a letter and can be followed by any number of letters or numbers. Note that keywords may be contained within IDs and they should be counted as IDs unless they perfectly match a keyword!- Regular Expression :
[a-zA-Z][a-zA-Z0-9]*
- Valid examples :
- a
- ABC
- a1b2c3DEF6
- while1
- ifelsewhile
- Regular Expression :
In grammars given later in this project description, we use the lexical representation of tokens instead of the token name; e.g. we write (
instead of Tok_LParen
. This table shows all mappings of tokens to their lexical representations, save for the three variant tokens specified above:
Token Name (in OCaml) | Lexical Representation (in grammars below) |
---|---|
Tok_LParen |
( |
Tok_RParen |
) |
Tok_LBrace |
{ |
Tok_RBrace |
} |
Tok_Equal |
== |
Tok_NotEqual |
!= |
Tok_Assign |
= |
Tok_Greater |
> |
Tok_Less |
< |
Tok_GreaterEqual |
>= |
Tok_LessEqual |
<= |
Tok_Or |
|| |
Tok_And |
&& |
Tok_Not |
! |
Tok_Semi |
; |
Tok_Type_Int |
int |
Tok_Type_Bool |
bool |
Tok_Print |
printf |
Tok_Main |
main |
Tok_If |
if |
Tok_Else |
else |
Tok_While |
while |
Tok_Plus |
+ |
Tok_Sub |
- |
Tok_Mult |
* |
Tok_Div |
/ |
Tok_Pow |
^ |
Your lexing code will feed the tokens into your parser, so a broken lexer will render the parser useless. Test your lexer thoroughly before moving on to the parser!
The Parser
Once the program has been transformed from a string of raw characters into more manageable tokens, youre ready to parse. The parser must be implemented in parser.ml
in accordance with the signatures for parse_expr
, parse_stmt
and parse_main
found in parser.mli
. parser.ml
is the only file you will write code in. The functions should be left in the order they are provided, as a good implementation will rely heavily on earlier functions.
We provide an ambiguous CFG for the language that you must convert to be right-recursive and right-associative, so it can be parsed by recursive descent. (By right associative, we are referring to binary infix operatorsso something like 1 + 2 + 3
will parse as Plus(Int(1), Plus(Int(2), Int(3)))
, essentially implying parentheses in the form (1 + (2 + 3))
.) As convention, in the given CFG all non-terminals are capitalized, all syntax literals (terminals) are formatted as non-italicized code
and will come in to the parser as tokens from your lexer. Variant token types (i.e. Tok_Bool, Tok_Int, and Tok_ID) will be printed as italicized code
.
Parser Part 1: parse_expr
Expressions are a self-contained subset of the SmallC grammar. As such, implementing them first will allow us to build the rest of the language on top of them later. The following is the expression type:
type expr =| Id of string| Int of int| Bool of bool| Plus of expr * expr| Sub of expr * expr| Mult of expr * expr| Div of expr * expr| Pow ofexpr * expr| Greater of expr * expr| Less of expr * expr| GreaterEqual of expr * expr| LessEqual of expr * expr| Equal of expr * expr| NotEqual of expr * expr| Or of expr * expr| And of expr * expr| Not of expr
The function parse_expr : token list -> token list * expr
takes a list of tokens and returns a tuple of the remaining tokens and the expr
that was parsed. Examples in class used a more imperative style with a global reference, but the parse_expr
and parse_stmt
functions in this project use a purely functional style where remaining tokens are returned along with the produced AST types. How you choose to use this part of the return value is up to you, but it must satisfy the same property of finally returning all remaining tokens regardless of your design decisions around it.
The (ambiguous) CFG of expressions, from which you should produce a value of expr
AST type, is as follows:
- Expression ::= OrExpression
- OrExpression ::= OrExpression
||
OrExpression | AndExpression - AndExpression ::= AndExpression
&&
AndExpression | EqualityExpression - EqualityExpression ::= EqualityExpression EqualityOperator EqualityExpression | RelationalExpression
- EqualityOperator ::=
==
|!=
- EqualityOperator ::=
- RelationalExpression ::= RelationalExpression RelationalOperator RelationalExpression | AdditiveExpression
- RelationalOperator ::=
<
|>
|<=
|>=
- RelationalOperator ::=
- AdditiveExpression ::= AdditiveExpression AdditiveOperator AdditiveExpression | MultiplicativeExpression
- AdditiveOperator ::=
+
|-
- AdditiveOperator ::=
- MultiplicativeExpression ::= MultiplicativeExpression MultiplicativeOperator MultiplicativeExpression | PowerExpression
- MultiplicativeOperator ::=
*
|/
- MultiplicativeOperator ::=
- PowerExpression ::= PowerExpression
^
PowerExpression | UnaryExpression - UnaryExpression ::=
!
UnaryExpression | PrimaryExpression - PrimaryExpression ::=
Tok_Int
|Tok_Bool
|Tok_ID
|(
Expression)
As an example, see how the parser will break down an input mixing a few different operators with different precedence:
Input:
2 * 3 ^ 5 + 4
Output (Stylized to show order):
Plus(Mult(Int(2),Pow(Int(3),Int(5))),Int(4))
Parser Part 2: parse_stmt
The next step to parsing is to build statements up around your expression parser. When parsing, a sequence of statements should be terminated as a NoOp
, which is a do-nothing operator. Here is the stmt
type:
type stmt =| NoOp| Seq of stmt * stmt| Declare of data_type * string| Assign of string * expr| If of expr * stmt * stmt| While of expr * stmt| Print of expr
The function parse_stmt : token list -> token list * stmt
takes a token list as input and returns a tuple of the tokens remaining and the stmt
that was parsed from the consumed input tokens. The stmt
type isnt self contained like the expr
type, and instead refers both to itself and to expr
; use your parse_expr
function to avoid duplicate code!
Again, we provide a grammar that is ambiguous and must be adjusted to be parsable by your recursive descent parser:
- Statement ::= Statement Statement | DeclareStatement | AssignStatement | PrintStatement | IfStatement | WhileStatement
- DeclareStatement ::= BasicType ID
;
- BasicType ::=
int
|bool
- BasicType ::=
- AssignStatement ::= ID
=
Expression;
- PrintStatement ::=
printf
(
Expression)
;
- IfStatement ::=
if
(
Expression)
{
Statement}
ElseBranch- ElseBranch ::=
else
{
Statement}
|
- ElseBranch ::=
- WhileStatement ::=
while
(
Expression)
{
Statement}
- DeclareStatement ::= BasicType ID
If we expand on our previous example, we can see how the expression parser integrates directly into the statement parser:
Input:
int x;x = 2 * 3 ^ 5 + 4;printf(x > 100);
Output (Stylized to show order):
Seq(Declare(Type_Int, "x"),Seq(Assign("x",Plus(Mult(Int(2),Pow(Int(3),Int(5))),Int(4))),Seq(Print(Greater(Id("x"), Int(100))), NoOp)))
Parser Part 3: parse_main
The last and shortest step is to have your parser handle the function entry point. This is where parse_main : token list -> stmt
comes in. This function behaves the exact same way as parse_stmt
, except for two key semantic details:
-
parse_main
will parse the function declaration for main, not just the body. -
parse_main
validates that a successful parse terminates inEOF
. A parse not ending inEOF
should raise anInvalidInputException
inparse_main
. As such,parse_main
does NOT return remaining tokens, since it validates ensures that the token list is emptied by the parse.
The grammar for this parse is provided here:
- Main ::=
int
main
(
)
{
Statement}
EOF
For this slightly modified input to the example used in the previous two sections, the exact same output would be produced:
Input:
int main() {int x;x = 2 * 3 ^ 5 + 4;printf(x > 100);}
The output is the exact same as in the statement parser, but parse_main
also trims off the function header and verifies that all tokens are consumed.
Reviews
There are no reviews yet.