In this project, you are asked to write a parser and a type checker for a small language. The parser checks that the input is syntactically correct and the type checker enforces the semantic rules of the language. The semantic rules that your program will enforce relate to declarations and types. In addition, your program will check for unused variables or variables used before they are assigned a value and output corresponding error messages.
The input to your code will be a program and the output will be: syntax error message if the input program is not syntactically correct
- if the input program has no syntax error, the output is:
- semantic error messages if there is a declaration error or a type mismatch in the input program,
- error messages if there are variables that are declared but never used or variables that are used before they are assigned a value, or
- information about the types of the symbols declared in the input program if there is no semantic error.
In what follows, I describe the language syntax and semantic rules.
1. Language Syntax
1.1. Grammar Description
program | scope | ||
scope | LBRACE scope list RBRACE | ||
scope list | scope | ||
scope list | var decl | ||
scope list | stmt | ||
scope list | scope scope list | ||
scope list | var decl scope list | ||
scope list | stmt scope list | ||
var decl | id list COLON type name SEMICOLON | ||
id list | ID | ||
id listtype name stmt list stmt list stmt stmt assign stmt while stmt while stmt expr expr | ID COMMA id list REAL | INT | BOOLEAN | STRING stmt stmt stmt list assign stmt while stmt ID EQUAL expr SEMICOLON WHILE condition LBRACE stmt list RBRACE WHILE condition stmt arithmetic operator expr expr binary boolean operator expr expr | ll | |
expr | relational operator expr expr | ||
expr | NOT expr | ||
expr | primary | ||
arithmetic operator | PLUS | MINUS | MULT | DIV | ||
binary boolean operator | AND | OR | XOR | ||
relational operator | GREATER | GTEQ | LESS | NOTEQUAL | LTEQ | ||
primary | ID | NUM | REALNUM | STRING CONSTANT | bool const | ||
bool const | TRUE | FALSE | ||
condition | LPAREN expr RPAREN | ||
Notice that expressions are in prefix notation in this language. This makes parsing easier.
1.2. Tokens
The tokens are included for completeness. You are provided with a getToken() function for the following tokens used in the grammar description:
LBRACE | = | { |
RBRACE | = | } |
COLON | = | : |
SEMICOLON | = | ; |
REAL | = | REAL |
INT | = | INT |
BOOLEAN | = | BOOLEAN |
STRING | = | STRING |
WHILE | = | WHILE |
COMMA | = | , |
LPAREN | = | ( |
RPAREN | = | ) |
EQUAL | = | = |
PLUS | = | + |
MULT | = | * |
DIV | = | / |
AND | = | ^ |
OR | = | | |
XOR | = | & |
NOT | = | ~ |
GREATER | = | > |
GTEQ | = | >= |
LESS | = | < |
LTEQ | = | <= |
NOTEQUAL | = | <> |
TRUE | = | TRUE |
FALSE | = | FALSE |
STRING_CONSTANT | = | (letter | digit)* |
ID | = | letter (letter | digit)* |
NUM | = | 0 | (pdigit digit*) |
REALNUM | = | NUM . digit digit* |
where
letter | = | a | b | c | | z | A | B | C | | Z | ||
digit | = | 0 | 1 | | | 2 | | 9 |
pdigit | = | 1 | 2 | 3 | | 9 |
2. Language Semantics
2.1. Scoping Rules
Static scoping is used. I have already covered static scoping in class, so you should know the semantics and how references should be resolved. Every scope defines a scope.
2.2. Types
The language has four basic types: INT , REAL , BOOLEAN , and STRING .
2.3. Variables
id list : type name |
Variables are declared explicitly in var decl of the form. The names of the declared variables appear in the id list and their type is the type name . Example Consider the following program written in our language:
{x : INT; y : INT; y = x;} |
This program has two declared variables: x and y . Both have type INT .
2.4. Declaration and Reference
Any occurrence of a name in the program is either a declaration or a reference. Any occurrence of a name in an id list that is part of a var decl is a declaration of the name. Any other occurrence of a name is considered a reference to that name. Note that the above definitions exclude the built-in type names, which are predeclared as part of the language definition.
Given the following example (the line numbers are not part of the input):
01 {02 | a : INT; |
03 | b : REAL; |
0405 } | b = x; |
We can categorize all occurrences of names as declaration or reference:
Line 2, the occurrence of name a is a declaration
- Line 3, the occurrence of name b is a declaration
- Line 4, the occurrence of name b is a reference
- Line 4, the occurrence of name x is a reference
2.5. Type System
We describe which assignments are valid (type compatibility) and how to determine the types of expressions (type inference).
2.5.1. Type Compatibility Rules
Type compatibility rules specify which assignments are valid. The Rules are the following.
- C1: If the types of the lefthand side of an assignment is INT , BOOLEAN , or STRING , the righthand side of the assignment should have the same type.
- C2: If the type of the lefthand side of an assignment is REAL , the type of the righthand side of the assignment should be INT or REAL .
- M1: Assignments that do not satisfy C1 or C2 are not valid. In this case, we say that there is a type mismatch.
2.5.2. Type Inference Rules
- C3: The types of the operands of an arithmetic operator should be REAL or INT
binary boolean operator |
- C4: The type of the operands of ashould be BOOLEAN .
- C5: If neither operand of a relational operator has type INT or REAL , then the operands should have the same type. In this case, both types can be STRING or both types can be BOOLEAN .
- C6: If one of the operands of a relational operator has type INT or REAL , then the other operand should have type INT or REAL . In this case, the two operands can have different types.
- C7: The type of a condition should be BOOLEAN .
- C8: The type of the operand of the NOT operator should be boolean.
- I1: If C3 is satisfied, and the type of one of the operands of an arithmetic operator is REAL , the type of the resulting expression is REAL .
- I2: For the PLUS , MINUS , and MULT operators, if the types of the two operands are INT , the type of the resulting expression is INT .
- I3: For the DIV operator, if the types of the two operands are INT , the type of the resulting expression is REAL .
- I4: If the operands of a binary boolean operator are BOOLEAN , the type of the resulting expression is BOOLEAN .
- I5: If C5 and C6 are satisfied, the type of a relational operator expr expr is BOOLEAN .
- I6: The type of NUM constants is INT
- I7: The type of REALNUM constants is REAL
bool const |
- I8: The type ofconstants is BOOLEAN
I9: The type of STRING CONSTANT constants is STRING
- M2: If C3, C4, C5, C6, C7 or C8 are not satisfied, the type of the expressions is ERROR. In this case, we say that there is a type mismatch.
3. Parser
You must write a parser for the grammar, If your parser detects a syntax error in the input, it should output the following message and exit:
Syntax Error
You should start coding by writing the parser first and make sure that your parser generates a syntax error message if the input program is syntactically incorrect. There will be no partial credit given for syntax checking
The grammar of the language is not LL(1) i.e. it does not satisfy the conditions for predictive parser with one token lookahead, however, it is still possible to write a recursive descent parser with no backtracking. We can do this by looking ahead at more than one token in some cases and left-factoring some rules.
Two rules like
stmt list | stmt |
stmt list | stmt stmt list |
stmt list |
stmt | |
stmt list |
can be parsed by first parsing and then checking if the next token is the beginning of
or in the FOLLOW of . In fact the two rules are equivalent to
stmt list | stmt stmt list1 |
stmt list1 | stmt |
but you do not need to explicitly left-factor the rules by introducing stmt list1 to parse stmt list .
4. Semantic Checking
Your program will check for the following semantic errors and output the apropriate message when it encounters that error.
4.1. Declaration Errors
- Variable declared more than once (error code 1.1)
id list |
A variable is declared more than once if appears more than once in the same id list of a var decl or if it appears in twoof two different var decl and locally looking up the
name that appears in one of the two declarations returns the other declaration.
- Undeclared variable (error code 2)
If resolving a variable name that appears in a statement other than a declaration returns no declaration, the variable is undeclared.
- Variable Declared but not used (error code 3)
If a variable is declared but is not referenced, we say that the variable is declared but not used. We say that a declaration is referenced if some reference to the variable resolves to the declaration. If a declaration is not referenced, we have error code 1.3.
id list |
Redeclaration or reference as a variable for of a built-in type name If a built-in type name appears inof a variable declaration or appears in a statement other than a declaration statement, your parser should output syntax error.
For each error involving declarations, your type checker should output one line in the following format:
ERROR CODE <code> <symbol_name>
in which <code> should be replaced with the proper code (see the error codes listed above) and <symbol name> should be replaced with the name of the variable that caused the error. Since the test cases will have at most one error each, the order in which these error messages are printed does not matter.
Note that there will only be at most one declaration error per test case.
4.2. Type Mismatch
If there are no declaration errors and any of the type constraints (listed in the Type System section above) is violated in the input program, then the output of your program should be:
TYPE MISMATCH <line_number> <constraint>
Where <line number> is replaced with the line number of the line in which the violation occurs and
<constraint> should be replaced with the label of the violated type constraint (possible values are C1, C2, C3, C4, C5, C6, C7 or C8. (see section on Type System for details of each constraint). Note that you can assume that anywhere a violation can occur it will be on a single line.
4.2.1. Type Mistmatch and Other Errors
- Type mismatch and C1 and C2 violations You should not declare C1 or C2 violation if the expression on the RHS has a type mismatch error. If there is a type mismatch error in the expression on the RHS of an assignment (C3, C4, C5, C6, C8), you should not also declare C1 or C2 errors. I illustrate this with an example
1: { | x : INT; |
2: | y : STRING; |
3: | z : INT; |
4:5: } | z = + x y; |
Here, on line 4, we have a violation of C3 because the type of y is STRING which violates C3. Given that C3 is violated, the type of + x y is really not defined and we could conclude that C1 is violated, but you should not declare a C1 violation in this case. Similar examples can be made for other combinations.
- Type mismatch and C7 Violations You should not declare C7 violation if the condition has a type mismatch error.
- Type Mismatch and Declaration Errors If there are declaration errors and type mismatch errors, only the output for the declaration errors should be produced.
Note that there will only be at most one type mismatch error per test case.
4.3. Use of Uninitialized Variables
For this part, your program should identify variables that are used before they are defined. We say that a variable is used if it appears in an expression. We say that a variable is defined if it appears on the lefthand side of an assignment. A variable is used before it is defined if there is an execution path in which there is a use of the variable that is not preceded by a definition of the variable. Compilers typically call this use of an uninitialized variable. In general determining use of uninitialized variables is involved and requires building a control flow graph of the program. Fortunately, for the language of this project, determining uninitialized uses is relatively easy. First, I will describe the output format, then I will give a more detailed description of how determine uninitialized uses.
The output format for use of uninitialized variables is the following
UNINITIALIZED <name_reference_1> <line_no_reference_1>
UNINITIALIZED <name_reference_2> <line_no_reference_2>
<name reference i> | is the name of ith uninitialized variable and | <line no reference i> |
Where is the line number in which the ith uninitialized reference appears.
In what follows, I will explain how to determine uses of uninitialized variables. Consider the program below (the line numbers are not part of the input but are added for ease of reference):
01 {02 | x1, x2, x3, x4 , x5 : INT; |
0304 | x3 = 2; |
0506 | WHILE ( < x1 100 ) { |
07 | x1 = + x1 1; |
08 | x2 = 1; |
09 | x4 = + x2 1; |
10 | x1 = + x3 1; |
11 | WHILE ( < x1 10 ) { |
12 | x5 = + x1 + x3 x5 ; |
13 | } |
14 | x1 = x5; |
15 | x5 = 1; |
16 | } |
171819 } | x2 = + x2 1 ; |
- The use of x1 in the expression < x1 100 is a use of an uninitialized variable. The first time the expression is evaluated, there is no previous definition (assignment) to x1. Note that later evaluations of < x1 10 happen after the assignment x1 = + x1 1; so the use of x1 in < x1 10 counts as initialized.
- The use of x1 in x1 = + x1 1; on line 07 is uninitialized because the first time + x1 1 is evaluated, x1 has no initial value.
- The use of x2 on line 09 is initialized because it is always preceded by the definition (assignment to) of x2 on line 08.
- The use of x3 on line 10 is initialized because it is always preceded by the definition (assignment to) x3 on line 04.
- The use of x1 on line 11 is initialized because it is always preceded by the definition (assignment to) x1 on line 10 (also on line 07).
- The use of x1 on line 12 is initialized because it is always preceded by the definition (assignment to) x1 on line 10 (also on line 07).
- The use of x3 on line 12 is initialized because it is always preceded by the definition (assignment to) x3 on line 04.
- The use of x5 on line 12 is not initialized because it is not preceded by a definition (assignment to) x5.
- The use of x5 on line 14 is not initialized because it is not preceded by a definition (assignment to) x5. Even though there is a definition (assignment to) of x5 on line 12, we cannot determine (in general) if the body of the loop will be executed, so we assume that the preceding loop did not execute. Note that this situation is different from the use of x1 on line 12 which is guaranteed to be preceded by the definition (assignment to) of x1 on line 10.
- The use of x2 on line 18 is not initialized because it is not preceded by a definition (assignment to) x2. Even though x2 is defined (assigned a value) on line 08, we cannot determine (in general) if the body of the loop is executed, so we have to assume that it is not executed.
It should be clear from the example, that to determine uninitialized variables, you can treat while stmt
while stmt |
as some kind of a scope. If a use is inside a, then all definitions preceding it in the while stmt together with all definitions preceding it in enclosing while stmt should be taken into consideration.
4.4. No Semantic Errors
If there are no declaration errors, no type mismatch errors and no use of uninitialized variables, then your program should output for each reference of a declared variable name, in the order in which the reference appears in the program, the line number in which the reference appears and the line number of the declaration to which the reference of the name resolves. For this part, the format should be the following
<name_reference_1> <line_no_reference_1> <line_no_declared_1>
<name_reference_2> <line_no_reference_2> <line_no_declared_2>
Where <name reference i> is the name of a variable and corresponds to ith name reference in the program. <line no reference i> is the line number in which the ith reference appears and <line no declared i> is the line number of the declaration corresponding to the ith reference.
5. More Examples
Example 1. Given the following example (the line numbers are not part of the input):
01 {02 | x : INT; |
03 | y : BOOLEAN; |
0405 } | y = x; |
The output will be the following:
TYPE MISMATCH 4 C1
Example 2. Given the following example (the line numbers are not part of the input):
01 {02 | x : BOOLEAN; |
03 | y : REAL; |
0405 } | y = x; |
The output will be the following:
TYPE MISMATCH 4 C2
Example 3. Given the following example (the line numbers are not part of the input):
01 {02 | x : BOOLEAN; |
03 | x : BOOLEAN; |
0405 } | x = x; |
The output will be the following:
ERROR CODE 1.1 x
Example 4. Given the following example (the line numbers are not part of the input):
01 {02 | x : INT; |
03 | y, x : STRING; |
04 | y = 10; |
0505 } | x = 10; |
The output will be the following:
ERROR CODE 1.1 x
Example 5. Given the following example (the line numbers are not part of the input):
01 {02 | x : INT; |
03 | y : STRING; |
0405 } | y = abc; |
The output will be the following:
ERROR CODE 1.3 x
Example 6. Given the following example (the line numbers are not part of the input):
01 {02 | x1 , x2 : INT; |
03 | y : INT; |
04 | x1 = y; |
0506 } | y = + x1 x2; |
The output will be the following:
UNINITIALIZED y 4
UNINITIALIZED x2 5
Example 7. Given the following example (the line numbers are not part of the input):
01 {02 | a : INT; |
03 | b : INT; |
04 | a = 1; |
05 | b = 1; |
06 | { a : INT; |
07 | a = 1; |
08 | WHILE ( > a b ) |
09 | { |
10 | a = a b; |
11 | } |
1213 } | } |
The output will be the following:
- 4 2b 5 3 a 7 6 a 8 6 b 8 3 a 10 6 a 10 6
- 10 3
6. Summary of Requirements
- If there is a syntax error, the program should output syntax error and exit.
- If there is no syntax error and there is a declaration error, the program should output the error message for the declaration error. You can assume that there can be no more than one declaration error in the test case. If the program outputs a declaration error, the program should output no other error message. Specifically, it should not output any type mismatch or uninitialized variable error message.
- If there is no syntax error and no declaration error and there is a type mismatch, the program should output the error message for the type mismatch. You can assume that there can be no more than one type mismatch per test case. If the program outputs a type mismatch error message, the program should output no other error message. Specifically, it should not output any uninitialized variable error message.
- If there is no syntax error, no declaration error and no type mismatch error, the program should output the list of variables used but not initialized in the format specified in Section 4.3.
- If there is no error of any kind, the program should output for each reference, the line number of the reference and the line number of the declaration to which the reference resolves as explained in Section 4.4 above.
7. Evaluation
Your submission will be graded on passing the automated test cases.
The test cases (there will be multiple test cases in each category, each with equal weight) will be broken down in the following way (out of 100 points):
- Parsing: 35 points
- Errors involving declarations: 20 points
- Type mismatch errors and no semantic error cases: 35 points
- Use of uninitialized variables: 10 points
There is no partial credit for the parsing category, you need to pass all test cases in that category to get the 35 points. All other categories have partial credit.
8. Implementation Suggestions
- the provided code has the main() function in lexer. It is better to have a new file for parser and move the main() function to the parser.
- You should first write the parser using the approach we studied in class.
- The parser will require more than one lookahead in some places.
- Do not start on any other part before finishing the parser.
- If you do not follow the approach we studied in class methodically, you can end up with a parser that passes almost all cases and for which it is hard to find what is wrong.
- In the past, some students could only fix their parser errors by rewriting it completely from scratch.
- Semantic checking might affect parsing.
- If there is syntax error, only the syntax error message should be produced and no other error messages should be produced.
- If you produce semantic error messages while parsing, you might end up generating a semantic error message before encountering the syntax error message. That is why you should save your semantic error message and wait until parsing is done to print it.
- If your program crashes or has segmentation fault, that can affect the output of syntax checking.
- Type checking for expressions.
- As described in class, you should have parse expr() return a type (am integer value denoting the type).
- For the base case expr -> primary , the type can be determined immediately for constants and by looking up the variable type in the case primary is ID .
- For the general case, it will be helpful to have a function in which all the type checking rules are implemented. The function takes as input an operator and two types (one is ignored in the case of unary operator) and returns the type of the resulting expressing or detects semantic error.
Reviews
There are no reviews yet.