# CSE340S21_Project2

# Introduction

I will start with a high-level description of the project in this section. In subsequent sections, I will go into a detailed description of the requirements and how to go about implementing a solution that satisfies them.

The goal of this project is to implement a lexical analyzer automatically for any list of tokens that are specified using regular expressions. The input to your program will have two parts:

- The first part of the input is a list of tokens separated by commas and terminated with the
_{# }(hash) symbol. Each token in the list consists of a token name and a token description. The token description is a regular expression for the token. The list has the following form:

t1_name t1_description , t2_name t2_description , , tk_name tk_description #

- The second part of the input is a
of letters and digits and space characters._{input string }

Your program will read the list of tokens, represent them internally in appropriate data structures, and then do lexical analysis on the * _{input string }*to break it down into a sequence of tokens and lexeme pairs from the provided list of tokens. The output of the program will be this sequence of tokens and lexemes. If during the processing of the input string, your program cannot identify a token to match from the list, it outputs ERROR and stops.

If the input to the program has a syntax error, then your program should not do any lexical analysis of the input string and instead it should output a syntax error message and exits.

More details about the input format and the expected output of your program are given in what follows.

The remainder of this document is organized as follows.

- The second section describes the input format.
- The third section describes the expected output.
- The fourth section describes the requirements on your solution and the grading criteria.
- The fifth and largest section is a detailed explanation how to go about implementing a solution. This section also includes a description of regular expressions.

# 3 Input Format

The input of your program is specified by the following context-free grammar:

input | tokens sectionINPUT TEXT | |

tokens section | token listHASH | |

token list | token | |

token list | token COMMA token list | |

token | ID expr | |

expr | CHAR | |

expr | LPAREN expr RPAREN DOT LPAREN expr RPAREN | |

expr | LPAREN expr RPAREN OR LPAREN expr RPAREN | |

expr | LPAREN expr RPAREN STAR | |

expr | UNDERSCORE |

Where

CHAR = a | b | | z | A | B | | Z | 0 | 1 | | 9

LETTER = a | b | | z | A | B | | Z

SPACE = |

| t

INPUT_TEXT = (CHAR | SPACE)*

COMMA = ,

LPAREN = (

RPAREN = )

STAR = *

DOT = .

OR = |

UNDERSCORE = _

ID = LETTER . CHAR*

Like the first project, you are provided with a lexer to read the input, but you are asked to write the parser. Compared to the first project, writing the parser should be easy.

In the description of regular expressions, UNDERSCORE represents epsilon (more about that later).

## 3.1 Examples

The following are examples of input.

- t1 (a)|(b) , t2 (a).((a)*) , t3 (((a)|(b))*).(c) #

a aa bb aab

This input specifies three tokens _{t1}, _{t2}, and _{t3 }and an INPUT TEXT a aa bb aab.

- t1 (a)|(b) , t2 ((c)*).(b) #

a aa bb aad aa

This input specifies two tokens _{t1}, _{t2}, and an INPUT TEXT a aa bb aad aa. 3. t1 (a)|(b) , t2 (c).((a)*) , t3 (((a)|(b))*).(((c)|(d))*)#

aaabbcaaaa

This input specifies three tokens _{t1}, _{t2 }and textttt3 and an INPUT TEXT aaabbcaaaa.

- tok (a).((b)|(_)) , toktok (a)|(_), tiktok ((a).(a)).(a) # aaabbcaaaa

This input specifies three tokens whose names are _{tok}, _{toktok}, and _{tiktok }and an INPUT TEXT aaabbcaaaa. Recall that in the description of regular expressions, underscore represents epsilon, so the regular expressions for the token _{tok }is equivalent to and the regular expressions for the token _{toktok }is equivalent to

**Note 1 **The code we provided breaks down the input to the program into tokens like ID, LPAREN, RPAREN and so on. Like the first project, to read the input, the code we provide has an object called _{lexer }and a function _{GetToken() }used in reading the input according to the fixed list of tokens for the input to the program. Your program will then have to break down the INPUT TEXT string into a sequence of tokens according to the list of token in the input to the program. In order not to confuse the function that breaks down the INPUT TEXT from the function _{GetToken() }in the code we provided, you should call your function something else like _{my }_{GetToken()}

# 4 Output Format

The output will be either SYNTAX ERROR if the input has a syntax error or a message indicating that one or more of the tokens have expressions that are not valid (see below) or a sequence of tokens and their corresponding lexemes according to the list of tokens provided if there are no errors. More specifically, the following are the output requirements.

- if the input to your program is not in the correct format (not according to the grammar in Section 2), your parser should output SYNTAX ERROR and nothing else, so you should make sure not to print anything before the complete parsing of the input is completed.
- if the input to your program is syntactically correct, then there are two cases to consider:
- If any of the regular expressions of the tokens in the list of tokens in the input to your program can generate the empty string, then your program should output

EPSILON IS NOOOOOT A TOKEN !!! tok_1 tok_2 tok_k

where _{tok 1}, _{tok 2}, _{}, _{tok k }is the list of tokens whose regular expressions can generate the empty string.

- If there is no syntax error and none of the expressions of the tokens can generate the empty string, your program should do lexical analysis on
_{INPUT TEXT }and produce a sequence of tokens and lexemes in_{INPUT TEXT }according to the list of tokens specified

in the input to your program. Each token and lexeme should be printed on a separate line. The output on a given line will be of the form

t , lexeme

where t is the name of a token and lexeme is the actual lexeme for the token t. If during lexical analysis of _{INPUT TEXT}, a syntax error is encountered then ERROR is printed on a separate line and the program exits.

In doing lexical analysis for _{INPUT TEXT}, _{SPACE }is treated as a separator and is otherwise ignored.

**Note 2 **The _{my }_{GetToken() }that you will write is a general function that takes a list of token representations and does lexical analysis according to those representations. In later sections, I explain how that can be done, so do not worry about it yet, but keep in mind that you will be writing a general _{my GetToken() }function.

**Examples **Each of the following examples gives an input and the corresponding expected output.

- t1 (a)|(b) , t2 ((a)*).(a) , t3 (((a)|(b))*).(((c)*).(c)) #

a aac bbc aabc

This input specifies three tokens _{t1}, _{t2}, and _{t3 }and an INPUT TEXT a aac bbc aabc. Since the input is in the correct format and none of the regular expressions generates epsilon, the output of your program should be the list tokens in the INPUT TEXT:

t1 , a t3 , aac t3 , bbc t3 , aabc

- t1 (a)|(b) , t2 ((a)*).(a) , t3 (((a)|(b))*).(c) #

a aa bbc aad aa

This input specifies three tokens _{t1}, _{t2}, and _{t3 }and an INPUT TEXT a aa bbc aad aa. Since the input is in the correct format and none of the regular expressions generates epsilon, the output of your program should be the list tokens in the INPUT TEXT the output of the program should be

t1 , a t2 , aa t3 , bbc t2 , aa

ERROR

Note that doing lexical analysis for INPUT TEXT according to the list of tokens produces ERROR after the second _{t2 }token because there is no token that starts with _{d}.

- t1a (a)|(b) , t2bc (a).((a)*) , t34 (((a)|(b))*).((c)|(d))# aaabbcaaaa

This input specifies three tokens whose names are _{t1a}, _{t2bc}, and _{t34 }and an input text aaabbcaaaa. Since the input is in the correct format, the output of your program should be the list tokens in the INPUT TEXT:

t34 , aaabbc t2bc , aaaa

- t1 (a)|(b) , t2 ((a)*).(a) , t3 (a)*, t4 b , t5 ((a)|(b))* #

a aac bbc aabc

This input specifies five tokens and an INPUT TEXT a aac bbc aabc. Since some of the regular expressions can generate epsilon, the output:

EPSILON IS NOOOOOT A TOKEN !!! t3 t5

# 5 Requirements and Grading

You should write a program to produce the correct output for a given input as described above. You will be provided with a number of test cases. Since this is the second project, the number of test cases provided with the project will be small relative to the number of test cases provided for project 1. **In your solution, you are not allowed to use any built-in or library support for regular expressions in C/C++. This requirement will be enforced by checking your code.**

The grade is broken down as follows

- Submission compiles and every function has comments and every file has your name:
_{10 }points - Submission does not compile or some functions have no comments or some submitted file does not have your name:
**no credit for the submission**. - Syntax checking:
(no partial credit for this)_{10 points } - EPSILON IS NOOOOOT A TOKEN !!! error:
(grade is strictly proportional to the number of test cases that your program successfully passes)_{15 points } - Lexical analysis of
_{INPUT }_{TEXT}:(grade is strictly proportional to the number of test cases that your program successfully passes)_{65 points }

The compiler and environment used is the same as for project 1, so refer to project 1 document for the information.

**Note 3 **If your code does not compile on the submission website, you will not receive any points, not even for documentation.

# 6 How to Implement a Solution

The main difficulty in coming up with a solution is to transform a given list of token names and their regular expression descriptions into a _{my }_{GetToken() }function for the given list of tokens. This transformation will be done in three high-level steps:

- Transform regular expressions into REGs. The goal here is to parse a regular expression description and generate a graph that represents the regular expression
^{[1]}. The generated graph will have a specific format and I will describe below how to generate it. I will call it a*regular expression graph*, or REG for short. - Write a function
_{match(r,s,p)}, where_{r }is a REG ,_{s }is a string and_{p }is a position in the string_{s}. The function match will return the longest possible lexeme starting from position_{p }in the string_{s }that matches the regular expression of the graph_{r}. - Write a class my LexicalAnalyzer(list,s) where list is a list of structures of the form {token name
*,reg**pointer*_{} }and s is an input string. my LexicalAnalyzer stores the list of structures and keeps track of the part of the input string that has been processed. The class myLexicalAnalyzer has a method my GetToken(). For every call of my GetToken(), match(r,s,p) is called for every REGin the list starting from the current position_{r }_{p }maintained in my LexicalAnalyzer. my GetToken() returns the token with the longest matching prefix together with its lexeme and updates the current position. If the longest matching prefix matches more than one token, the matched token that is listed first in the list of tokens is returned.

In what follows I describe how a regular expression description can be transformed into a REG and how to implement the function _{match(r,s,p)}. But first, I will give an overview of regular expressions and the sets of strings they represent.

## 6.1 Set of Strings Represented by Regular Expressions

A regular expression is a compact representation of a set, possibly infinite, of strings. For a given regular expression, we say that expression can * _{generate }*a string if the string is in set that is represented by the regular expression. We start with a general description, then we give examples.

### 6.1.1 General description

We start with the simple expressions (the base cases)

- (
) The regular expression_{One-character strings}_{a }represents the set of strings {*a*_{}}, that is the set consisting of only the string_{a}. - (
) The regular expression represents the set of strings {_{Epsilon}_{}}, that is the set consisting of only the string (which is the empty string).

For the inductive step (recursion of your parser), there are four cases:

- (
) If_{Parentheses}_{R }is a regular expression, the regular expression_{(R) }represents the same set of strings that_{R }The parentheses are used for grouping and to facilitate parsing and do not have a meaning otherwise. - (
) If_{Concatenation}_{R1 }and_{R2 }are regular expressions that represents sets of strings_{S1 }and_{S2 }respectively, then_{(R1).(R2) }represents a new set of strings that can be obtained by concatenating one string from_{S1 }with one string from_{S2 }(order matters). - (
) If_{Union}_{R1 }and_{R2 }are regular expressions that represents sets of strings_{S1 }and_{S2 }respectively, then_{(R1)|(R2) }represents the union of the two sets of strings_{S1 }and_{S2}. () The last case is the most interesting because it allows us unlimited number of repetition. If_{Kleene star}_{R }is a regular expression that represents the set of strings_{S}, then_{(R)* }represents the set of strings that can be obtained by concatenating any number of strings from_{S}, including zero strings (which gives us epsilon).

### 6.1.2 Examples

- The set of strings represented by
_{a }is

{*a*}

- The set of strings represented by
_{b }is

{*b*}

- The set of strings represented by
_{(a)|(b) }is

{*a, b*}

- The set of strings represented by
_{((a)|(b)).(c) }is

{*ac, bc*}

- The set of strings represented by ((a)|(b)).((c)|(d)) is

{*ac, ad, bc, bd*}

- The set of strings represented by ((c)|(d)).((a)|(b)) is

{*ca, cb, da, db*}

- The set of strings represented by
_{(a)* }is

{*, a, aa, aaa, aaaa, *}

- The set of strings represented by
_{(b)* }is

{*, b, bb, bbb, bbbb, *}

Figure 1: Regular expressions graphs for the base cases

Figure 2: Regular expression graph for the an expression obtained using the dot operator

- The set of strings represented by
_{(a)|((b)*) }is

{*a, , b, bb, bbb, bbbb, *}

- The set of strings represented by
_{((a)*)|((b)*) }is

{*,a,b,aa,bb,aaa,bbb,aaaa,bbbb,*}

- The set of strings represented by
_{((a)|(b))* }is

{*, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, *}

## 6.2 Constructing REGs

The construction of REGs is done recursively. The construction we use is called Thompsons construction. Each REG has a one _{start }node and one _{accept }node. For the base cases of epsilon and _{a}, where _{a }is a character of the alphabet, the REGs are shown in Figure 1. For the recursive cases, the constructions are shown in Figures 2, 3, and 4. An example REG for the regular expression _{((a)*).((b)*) }is shown in Figure 5.

Figure 3: Regular expression graph for the an expression obtained using the or operator

Figure 4: Regular expression graph for the an expression obtained using the star operator

### 6.2.1 Data Structures and Code for REGs

In the construction of REGs, every node has at most two outgoing arrows. This will allow us to use a simple representation of a REG node.

Figure 5: Regular expression graph for the an expression obtained using concatenation and star operators

struct REG_node { struct REG_node * first_neighbor; char first_label; struct REG_node * second_neighbor; char second_label;

}

In the representation, first neighbor is the first node pointed to by a node and second neighbor is the second node pointed to by a node. _{first label }and _{second label }are the labels of the arrows from the node to its neighbors. If a node has only one neighbor, then _{second neighbor }will be NULL. If a node has no neighbors, then both first neighbor and second neighbor will be NULL.

Figure 6: Data structure representation for the REG of _{((a)*).((b*))}

struct REG { struct REG_node * start; struct REG_node * accept;

}

In the your parser, you should write a function _{parse expr() }that parses the regular expressions returns the REG of the regular expression that is parsed. The construction of REGs is done recursively. An outline of the process is shown on the next page.

struct REG * parse_expr() {

// if expression is UNDERSCORE or a CHAR, say a for example

// create a REG for the expression and return a pointer to it

// (see Figure 1, for how the REG looks like)

// if expression is (R1).(R2)

//

// the program will call parse_expr() twice, once

// to parse R1 and once to parse R2

//

// Each of the two calls will return a REG, say they are

// r1 and r2

//

// construct a new REG r for (R1).(R2) using the

// two REGs r1 and r2

// (see Figure 2 for how the two REGs are combined)

//

// return r

//

// the cases for (R1)|(R2) and (R)* are similar and

// are omitted from the description

}

### 6.2.2 Detailed Examples for REG Construction

I consider the regular expression _{((a)*).((b)*) }and explain step by step how its REG is constructed (Figure 5).

When parsing _{((a)*).((b)*)}, the first expression to be fully parsed and its REG is constructed is _{a }(Figure 1). In Figure 5, the nodes for the REG of the regular expression _{a }have numbers 1 and 2 to indicate that they are the first two nodes to be created.

The second expression to be fully parsed and its REG constructed when parsing _{((a)*).((b)*) }is _{(a)*}. The REG for _{(a)* }is obtained from the REG for the regular expression _{a }by adding two more nodes (3 and 4) and adding the appropriate arrows as described in the general case in Figure 4.

The starting node for the REG of _{(a)* }is the newly created node 3 and the accepting node is the newly created node 4.

The third regular expression to be fully parsed while parsing _{((a)*).((b)*) }is the regular expression _{b}. The REG for regular expression _{b }is constructed as shown in Figure 1. The nodes for this REG are numbered 5 and 6.

The fourth regular expression to be fully parsed while parsing _{((a)*).((b)*) }is _{(b)*}. The REG for _{(b)* }is obtained from the REG for the regular expression _{b }by adding two more nodes (7 and 8) and adding the appropriate arrows as described in the general case in Figure 4. The starting node for the REG of _{(b)* }is the newly created node 7 and the accepting node is the newly created node 8.

Finally, the last regular expression to be fully parsed is the regular expression _{((a)*).((b)*)}.

The REG of _{((a)*).((b)*) }is obtained from the REGs of _{(a)* }and _{(b)* }by creating a new REG whose initial node is node 3 and whose accepting node is node 8 and adding an arrow from node 4 (the accepting node of the REG of _{(a)*}) to node 7 (the initial node for the REG of _{(b)*}).

Another example for the REG of _{(((a)*).((b).(b)))|((a)*) }is shown in Figures 8 and 9. In the next section, I will use REG of (((a)*).((b).(b)))|((a)*) to illustrate how match(r,s,p) can be implemented.

## 6.3 Implementing match(r,s,p)

Given an REG _{r}, a string _{s }and a position _{p }in the string _{s}, we would like to determine the longest possible lexeme that matches the regular expression for _{r}.

As you will see in CSE355, a string _{w }is in L(_{R}) for a regular expression _{R }with REG _{r }if and only if there is a path from the starting node of _{r }to the accepting node of _{r }such that _{w }is equal to the concatenation of all labels of the edges along the path. I will not go into the details of the equivalence in this document. I will describe how to find the longest possible substring _{w }of _{s }starting at position _{p }such that there is a path from the starting node of _{r }to the accepting node of _{r }that can be labeled with _{w}.

To implement _{match(r,s,p)}, we need to be able to determine for a given input character _{a }and a set of nodes _{S }the set of nodes that can be reached from nodes in _{S }by _{consuming }_{a}. To consume _{a }we can traverse any number of edges labeled , traverse one edge labeled _{a}, then traverse any number of edges labeled . To match one character, you will implement a function called _{Match }_{One Char() }shown in Figure 7. For a given character _{c }and a given set of nodes

Match_One_Char(set_of_nodes S, char c) returns set_of_nodes

{

// 1. find all nodes that can be reached from S by consuming c

//

// S = empty set

// for every node n in S

// if ( (there is an edge from n to m labeled with c) &&

// ( m is not in S) ) {

// add m to S

// }

//

// if (S is empty)

// return empty set

//

// At this point, S is not empty and it contains the nodes that

// can be reached from S by consuming the character c directly

//

// 2. find all nodes that can be reached from the resulting

// set S by consuming no input

//

// changed = true

// S = empty set

// while (changed) {

// changed = false

// for every node n in S {

// add n to S

// for ever neighbor m of n {

// if ( (the edge from n to m labeled with _) &&

// ( m is not in S) )

// add m to S

// }

// }

// if (S not equal to S) {

// changed = true;

// S = S

// S = empty set

// }

// }

//

// at this point the set S contains all nodes that can be reached

// from S by first consuming C, then traversing 0 or more epsilon

// edges

//

// return S

}

Figure 7: Pseudocode for matching one character

.

Figure 8: Regular expression graph ((a)*).((b).(b))

S, _{Match }_{One Char() }will find all the nodes that can be reached from _{S }by consuming the single character _{c}.

In order to match a whole string, we need to match the characters of the strings one after another. At each step, the solution will keep track of the set of nodes _{S }that can be reached by consuming the prefix of the input string that has been processed so far.

To implement _{match(r,s,p)}, we start with the set of nodes that can be reached from the starting node of _{r }by consuming no input. Then we repeatedly call _{Match One }_{Char() }for successive characters of the string _{s }starting from position _{p }until the returned set of nodes _{S }is empty or we run out of input. If at any point during the repeated calls to _{Match }_{One Char() }the set _{S }of nodes contains the accepting node, we note the fact that the prefix of string _{s }starting from position _{p }up to the current position is matching. At the end of the calls to _{Match One }_{Char() }when _{S }is empty or the end of input is reached, the last matched prefix is the one returned by _{match(r,s,p)}. If none of the prefixes are matched, then there is no match for _{r }in _{s }starting at _{p}.

**Note 4**

- The algorithms given above are not the most efficient, but they are probably the simplest to implement the matching functions.
- The algorithm uses sets, so you need to have a representation for a set of nodes and to do operations on sets of nodes.

## 6.4 Detailed Example for Implementing match(r,s,p)

In this section, I illustrate the steps of an execution of _{match(r,s,p) }on the REG of

(((a)*).((b).(b)))|((a)*) shown in Figure 9. The input string we will consider is the string _{s = }_{a}_{a}_{ba }and the initial position is _{p = 0}.

Figure 9: Regular expression graph _{(((a)*).((b).(b)))|((a)*)}

- Initially, the set of states that can reached by consuming no input starting from node 17 is

*S*_{0 }= _{{}17*,*3*,*1*,*4*,*9*,*15*,*13*,*16*,***18**_{}}

Note that _{S}_{0 }contains node ** _{18 }**which means that the empty string is a matching prefix. This means that this expression should result in a

_{EPSILON IS NOOOOOT A TOKEN !!!}error if it is used in a token specification.

- Consuming
_{a}.

The set of states that can be reached by consuming _{a }starting from _{S}_{0 }is

*S*_{1 }= _{{}2*,*14}

The set of states that can be reached by consuming no input starting from _{S}_{1 }is

*S*_{1 } = _{{}2*,*1*,*4*,*9*,*14*,*13*,*16*,***18**_{}}

Note that _{S}_{1 } contains node ** _{18}**, which means that the prefix

_{}

_{a}

_{ }is a matching prefix.

- Consuming
_{a}

The set of states that can be reached by consuming _{a }starting from _{S}_{1 } is

*S*_{2 }= _{{}2*,*14}

The set of states that can be reached by consuming no input starting from _{S}_{2 }is

*S*_{2 } = _{{}2*,*1*,*4*,*9*,*14*,*13*,*16*,***18**_{}}

Note that _{S}_{2 } contains node ** _{18}**, which means that the prefix

_{}

_{a}

_{a}

_{ }is a matching prefix.

- Consuming
_{b}

The set of states that can be reached by consuming _{b }starting from _{S}_{2 } is

*S*_{3 }= _{{}10}

The set of states that can be reached by consuming no input starting from _{S}_{3 }is

*S*_{3 } = _{{}10*,*11}

Note that _{S}_{3 }does not contain node ** _{18 }**which means that

_{}

_{a}

_{a}

_{b }is not a matching prefix, but is still a viable prefix, which means that there is hope we can read more characters that will turn it into a matching prefix.

- Consuming
_{a}

The set of states that can be reached by consuming _{a }starting from _{S}_{3 } is

*S*4 = {}

Since _{S}_{4 }is empty, _{}_{a}_{a}_{ba }is not viable and we stop.

The longest matching prefix is _{a}_{a}. This is the lexeme that is returned. Note that the second call to _{match(r,s,p) }starting after _{}_{a}_{a}_{ }will return ERROR.

[1] The graph is a representation of a non-deterministic finite state automaton

## Reviews

There are no reviews yet.