CSE340S21_Project2

Name: [SOLVED] CSE340 Project 2: Generating a lexical analyzer automatically! - Assignment Chef
SKU: [Solved] CSE340 Project 2: Generating a lexical analyzer automatically!
Price: 30 USD
Availability: InStock

Introduction

I will start with a high-level description of the project in this section. In subsequent sections, I will go into a detailed description of the requirements and how to go about implementing a solution that satisfies them.

The goal of this project is to implement a lexical analyzer automatically for any list of tokens that are specified using regular expressions. The input to your program will have two parts:

The first part of the input is a list of tokens separated by commas and terminated with the _#(hash) symbol. Each token in the list consists of a token name and a token description. The token description is a regular expression for the token. The list has the following form:

t1_name t1_description , t2_name t2_description , … , tk_name tk_description #

The second part of the input is a _{input string}of letters and digits and space characters.

Your program will read the list of tokens, represent them internally in appropriate data structures, and then do lexical analysis on the _{input string}to break it down into a sequence of tokens and lexeme pairs from the provided list of tokens. The output of the program will be this sequence of tokens and lexemes. If during the processing of the input string, your program cannot identify a token to match from the list, it outputs ERROR and stops.

If the input to the program has a syntax error, then your program should not do any lexical analysis of the input string and instead it should output a syntax error message and exits.

More details about the input format and the expected output of your program are given in what follows.

The remainder of this document is organized as follows.

The second section describes the input format.
The third section describes the expected output.
The fourth section describes the requirements on your solution and the grading criteria.
The fifth and largest section is a detailed explanation how to go about implementing a solution. This section also includes a description of regular expressions.

3 Input Format

The input of your program is specified by the following context-free grammar:

input	→	tokens sectionINPUT TEXT
tokens section	→	token listHASH
token list	→	token
token list	→	token COMMA token list
token	→	ID expr
expr	→	CHAR
expr	→	LPAREN expr RPAREN DOT LPAREN expr RPAREN
expr	→	LPAREN expr RPAREN OR LPAREN expr RPAREN
expr	→	LPAREN expr RPAREN STAR
expr	→	UNDERSCORE

Where

CHAR = a | b | … | z | A | B | … | Z | 0 | 1 | … | 9

LETTER = a | b | … | z | A | B | … | Z

SPACE = ’ ’ |
| t

INPUT_TEXT = ” (CHAR | SPACE)* ”

COMMA = ’,’

LPAREN = ’(’

RPAREN = ’)’

STAR = ’*’

DOT = ’.’

OR = ’|’

UNDERSCORE = ’_’

ID = LETTER . CHAR*

Like the first project, you are provided with a lexer to read the input, but you are asked to write the parser. Compared to the first project, writing the parser should be easy.

In the description of regular expressions, UNDERSCORE represents epsilon (more about that later).

3.1 Examples

The following are examples of input.

t1 (a)|(b) , t2 (a).((a)*) , t3 (((a)|(b))*).(c) #

“a aa bb aab”

This input specifies three tokens _t1, _t2, and _t3and an INPUT TEXT “a aa bb aab”.

t1 (a)|(b) , t2 ((c)*).(b) #

“a aa bb aad aa”

This input specifies two tokens _t1, _t2, and an INPUT TEXT “a aa bb aad aa”. 3. t1 (a)|(b) , t2 (c).((a)*) , t3 (((a)|(b))*).(((c)|(d))*)#

“aaabbcaaaa”

This input specifies three tokens _t1, _t2and textttt3 and an INPUT TEXT “aaabbcaaaa”.

tok (a).((b)|(_)) , toktok (a)|(_), tiktok ((a).(a)).(a) # “aaabbcaaaa”

This input specifies three tokens whose names are _tok, _toktok, and _tiktokand an INPUT TEXT “aaabbcaaaa”. Recall that in the description of regular expressions, underscore represents epsilon, so the regular expressions for the token _tokis equivalent to and the regular expressions for the token _toktokis equivalent to

Note 1 The code we provided breaks down the input to the program into tokens like ID, LPAREN, RPAREN and so on. Like the first project, to read the input, the code we provide has an object called _lexerand a function _GetToken()used in reading the input according to the fixed list of tokens for the input to the program. Your program will then have to break down the INPUT TEXT string into a sequence of tokens according to the list of token in the input to the program. In order not to confuse the function that breaks down the INPUT TEXT from the function _GetToken()in the code we provided, you should call your function something else like _my_GetToken()

4 Output Format

The output will be either SYNTAX ERROR if the input has a syntax error or a message indicating that one or more of the tokens have expressions that are not valid (see below) or a sequence of tokens and their corresponding lexemes according to the list of tokens provided if there are no errors. More specifically, the following are the output requirements.

if the input to your program is not in the correct format (not according to the grammar in Section 2), your parser should output SYNTAX ERROR and nothing else, so you should make sure not to print anything before the complete parsing of the input is completed.
if the input to your program is syntactically correct, then there are two cases to consider:
- If any of the regular expressions of the tokens in the list of tokens in the input to your program can generate the empty string, then your program should output

EPSILON IS NOOOOOT A TOKEN !!! tok_1 tok_2 … tok_k

where _{tok 1}, _{tok 2}, _…, _{tok k}is the list of tokens whose regular expressions can generate the empty string.

If there is no syntax error and none of the expressions of the tokens can generate the empty string, your program should do lexical analysis on _{INPUT TEXT}and produce a sequence of tokens and lexemes in _{INPUT TEXT}according to the list of tokens specified

in the input to your program. Each token and lexeme should be printed on a separate line. The output on a given line will be of the form

t , “lexeme”

where t is the name of a token and lexeme is the actual lexeme for the token t. If during lexical analysis of _{INPUT TEXT}, a syntax error is encountered then ERROR is printed on a separate line and the program exits.

In doing lexical analysis for _{INPUT TEXT}, _SPACEis treated as a separator and is otherwise ignored.

Note 2 The _my_GetToken()that you will write is a general function that takes a list of token representations and does lexical analysis according to those representations. In later sections, I explain how that can be done, so do not worry about it yet, but keep in mind that you will be writing a general _{my GetToken()}function.

Examples Each of the following examples gives an input and the corresponding expected output.

t1 (a)|(b) , t2 ((a)*).(a) , t3 (((a)|(b))*).(((c)*).(c)) #

“a aac bbc aabc”

This input specifies three tokens _t1, _t2, and _t3and an INPUT TEXT “a aac bbc aabc”. Since the input is in the correct format and none of the regular expressions generates epsilon, the output of your program should be the list tokens in the INPUT TEXT:

t1 , “a” t3 , “aac” t3 , “bbc” t3 , “aabc”

t1 (a)|(b) , t2 ((a)*).(a) , t3 (((a)|(b))*).(c) #

“a aa bbc aad aa”

This input specifies three tokens _t1, _t2, and _t3and an INPUT TEXT “a aa bbc aad aa”. Since the input is in the correct format and none of the regular expressions generates epsilon, the output of your program should be the list tokens in the INPUT TEXT the output of the program should be

t1 , “a” t2 , “aa” t3 , “bbc” t2 , “aa”

ERROR

Note that doing lexical analysis for INPUT TEXT according to the list of tokens produces ERROR after the second _t2token because there is no token that starts with _’d’.

t1a (a)|(b) , t2bc (a).((a)*) , t34 (((a)|(b))*).((c)|(d))# “aaabbcaaaa”

This input specifies three tokens whose names are _t1a, _t2bc, and _t34and an input text “aaabbcaaaa”. Since the input is in the correct format, the output of your program should be the list tokens in the INPUT TEXT:

t34 , “aaabbc” t2bc , “aaaa”

t1 (a)|(b) , t2 ((a)*).(a) , t3 (a)*, t4 b , t5 ((a)|(b))* #

“a aac bbc aabc”

This input specifies five tokens and an INPUT TEXT “a aac bbc aabc”. Since some of the regular expressions can generate epsilon, the output:

EPSILON IS NOOOOOT A TOKEN !!! t3 t5

5 Requirements and Grading

You should write a program to produce the correct output for a given input as described above. You will be provided with a number of test cases. Since this is the second project, the number of test cases provided with the project will be small relative to the number of test cases provided for project 1. In your solution, you are not allowed to use any built-in or library support for regular expressions in C/C++. This requirement will be enforced by checking your code.

The grade is broken down as follows

Submission compiles and every function has comments and every file has your name: ₁₀points
Submission does not compile or some functions have no comments or some submitted file does not have your name: no credit for the submission.
Syntax checking: _{10 points}(no partial credit for this)
EPSILON IS NOOOOOT A TOKEN !!! error: _{15 points}(grade is strictly proportional to the number of test cases that your program successfully passes)
Lexical analysis of _INPUT_TEXT: _{65 points}(grade is strictly proportional to the number of test cases that your program successfully passes)

The compiler and environment used is the same as for project 1, so refer to project 1 document for the information.

Note 3 If your code does not compile on the submission website, you will not receive any points, not even for documentation.

6 How to Implement a Solution

The main difficulty in coming up with a solution is to transform a given list of token names and their regular expression descriptions into a _my_GetToken()function for the given list of tokens. This transformation will be done in three high-level steps:

Transform regular expressions into REGs. The goal here is to parse a regular expression description and generate a graph that represents the regular expression^[1]. The generated graph will have a specific format and I will describe below how to generate it. I will call it a regular expression graph, or REG for short.
Write a function _match(r,s,p), where _ris a REG , _sis a string and _pis a position in the string _s. The function match will return the longest possible lexeme starting from position _pin the string _sthat matches the regular expression of the graph _r.
Write a class my LexicalAnalyzer(list,s) where list is a list of structures of the form {token name,reg pointer_}and s is an input string. my LexicalAnalyzer stores the list of structures and keeps track of the part of the input string that has been processed. The class myLexicalAnalyzer has a method my GetToken(). For every call of my GetToken(), match(r,s,p) is called for every REG _rin the list starting from the current position _pmaintained in my LexicalAnalyzer. my GetToken() returns the token with the longest matching prefix together with its lexeme and updates the current position. If the longest matching prefix matches more than one token, the matched token that is listed first in the list of tokens is returned.

In what follows I describe how a regular expression description can be transformed into a REG and how to implement the function _match(r,s,p). But first, I will give an overview of regular expressions and the sets of strings they represent.

6.1 Set of Strings Represented by Regular Expressions

A regular expression is a compact representation of a set, possibly infinite, of strings. For a given regular expression, we say that expression can _generatea string if the string is in set that is represented by the regular expression. We start with a general description, then we give examples.

6.1.1 General description

We start with the simple expressions (the base cases)

(_{One-character strings}) The regular expression _arepresents the set of strings {a_}, that is the set consisting of only the string _“a”.
(_Epsilon) The regular expression represents the set of strings {_}, that is the set consisting of only the string (which is the empty string).

For the inductive step (recursion of your parser), there are four cases:

(_Parentheses) If _Ris a regular expression, the regular expression _(R)represents the same set of strings that _R The parentheses are used for grouping and to facilitate parsing and do not have a meaning otherwise.
(_{Concatenation}) If _R1and _R2are regular expressions that represents sets of strings _S1and _S2respectively, then _(R1).(R2)represents a new set of strings that can be obtained by concatenating one string from _S1with one string from _S2(order matters).
(_Union) If _R1and _R2are regular expressions that represents sets of strings _S1and _S2respectively, then _(R1)|(R2)represents the union of the two sets of strings _S1and _S2. • (_{Kleene star}) The last case is the most interesting because it allows us unlimited number of repetition. If _Ris a regular expression that represents the set of strings _S, then _(R)*represents the set of strings that can be obtained by concatenating any number of strings from _S, including zero strings (which gives us epsilon).

6.1.2 Examples

The set of strings represented by _ais

{a}

The set of strings represented by _bis

{b}

The set of strings represented by _(a)|(b)is

{a, b}

The set of strings represented by _{((a)|(b)).(c)}is

{ac, bc}

The set of strings represented by ((a)|(b)).((c)|(d)) is

{ac, ad, bc, bd}

The set of strings represented by ((c)|(d)).((a)|(b)) is

{ca, cb, da, db}

The set of strings represented by _(a)*is

{, a, aa, aaa, aaaa, …}

The set of strings represented by _(b)*is

{, b, bb, bbb, bbbb, …}

Figure 1: Regular expressions graphs for the base cases

Figure 2: Regular expression graph for the an expression obtained using the dot operator

The set of strings represented by _(a)|((b)*)is

{a, , b, bb, bbb, bbbb, …}

The set of strings represented by _{((a)*)|((b)*)}is

{,a,b,aa,bb,aaa,bbb,aaaa,bbbb,…}

The set of strings represented by _((a)|(b))*is

{, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, …}

6.2 Constructing REGs

The construction of REGs is done recursively. The construction we use is called Thompson’s construction. Each REG has a one _startnode and one _acceptnode. For the base cases of epsilon and _a, where _ais a character of the alphabet, the REGs are shown in Figure 1. For the recursive cases, the constructions are shown in Figures 2, 3, and 4. An example REG for the regular expression _{((a)*).((b)*)}is shown in Figure 5.

Figure 3: Regular expression graph for the an expression obtained using the or operator

Figure 4: Regular expression graph for the an expression obtained using the star operator

6.2.1 Data Structures and Code for REGs

In the construction of REGs, every node has at most two outgoing arrows. This will allow us to use a simple representation of a REG node.

Figure 5: Regular expression graph for the an expression obtained using concatenation and star operators

struct REG_node { struct REG_node * first_neighbor; char first_label; struct REG_node * second_neighbor; char second_label;

}

In the representation, first neighbor is the first node pointed to by a node and second neighbor is the second node pointed to by a node. _{first label}and _{second label}are the labels of the arrows from the node to its neighbors. If a node has only one neighbor, then _{second neighbor}will be NULL. If a node has no neighbors, then both first neighbor and second neighbor will be NULL.

Figure 6: Data structure representation for the REG of _{((a)*).((b*))}

struct REG { struct REG_node * start; struct REG_node * accept;

}

In the your parser, you should write a function _{parse expr()}that parses the regular expressions returns the REG of the regular expression that is parsed. The construction of REGs is done recursively. An outline of the process is shown on the next page.

struct REG * parse_expr() {

// if expression is UNDERSCORE or a CHAR, say ’a’ for example

// create a REG for the expression and return a pointer to it

// (see Figure 1, for how the REG looks like)

// if expression is (R1).(R2)

// the program will call parse_expr() twice, once

// to parse R1 and once to parse R2

// Each of the two calls will return a REG, say they are

// r1 and r2

// construct a new REG r for (R1).(R2) using the

// two REGs r1 and r2

// (see Figure 2 for how the two REGs are combined)

// return r

// the cases for (R1)|(R2) and (R)* are similar and

// are omitted from the description

}

6.2.2 Detailed Examples for REG Construction

I consider the regular expression _{((a)*).((b)*)}and explain step by step how its REG is constructed (Figure 5).

When parsing _{((a)*).((b)*)}, the first expression to be fully parsed and its REG is constructed is _a(Figure 1). In Figure 5, the nodes for the REG of the regular expression _ahave numbers 1 and 2 to indicate that they are the first two nodes to be created.

The second expression to be fully parsed and its REG constructed when parsing _{((a)*).((b)*)}is _(a)*. The REG for _(a)*is obtained from the REG for the regular expression _aby adding two more nodes (3 and 4) and adding the appropriate arrows as described in the general case in Figure 4.

The starting node for the REG of _(a)*is the newly created node 3 and the accepting node is the newly created node 4.

The third regular expression to be fully parsed while parsing _{((a)*).((b)*)}is the regular expression _b. The REG for regular expression _bis constructed as shown in Figure 1. The nodes for this REG are numbered 5 and 6.

The fourth regular expression to be fully parsed while parsing _{((a)*).((b)*)}is _(b)*. The REG for _(b)*is obtained from the REG for the regular expression _bby adding two more nodes (7 and 8) and adding the appropriate arrows as described in the general case in Figure 4. The starting node for the REG of _(b)*is the newly created node 7 and the accepting node is the newly created node 8.

Finally, the last regular expression to be fully parsed is the regular expression _{((a)*).((b)*)}.

The REG of _{((a)*).((b)*)}is obtained from the REGs of _(a)*and _(b)*by creating a new REG whose initial node is node 3 and whose accepting node is node 8 and adding an arrow from node 4 (the accepting node of the REG of _(a)*) to node 7 (the initial node for the REG of _(b)*).

Another example for the REG of _{(((a)*).((b).(b)))|((a)*)}is shown in Figures 8 and 9. In the next section, I will use REG of (((a)*).((b).(b)))|((a)*) to illustrate how match(r,s,p) can be implemented.

6.3 Implementing match(r,s,p)

Given an REG _r, a string _sand a position _pin the string _s, we would like to determine the longest possible lexeme that matches the regular expression for _r.

As you will see in CSE355, a string _wis in L(_R) for a regular expression _Rwith REG _rif and only if there is a path from the starting node of _rto the accepting node of _rsuch that _wis equal to the concatenation of all labels of the edges along the path. I will not go into the details of the equivalence in this document. I will describe how to find the longest possible substring _wof _sstarting at position _psuch that there is a path from the starting node of _rto the accepting node of _rthat can be labeled with _w.

To implement _match(r,s,p), we need to be able to determine for a given input character _aand a set of nodes _Sthe set of nodes that can be reached from nodes in _Sby _consuming_a. To consume _awe can traverse any number of edges labeled ’ ’, traverse one edge labeled _a, then traverse any number of edges labeled ’ ’. To match one character, you will implement a function called _Match_{One Char()}shown in Figure 7. For a given character _cand a given set of nodes

Match_One_Char(set_of_nodes S, char c) returns set_of_nodes

{

// 1. find all nodes that can be reached from S by consuming c

// S’ = empty set

// for every node n in S

// if ( (there is an edge from n to m labeled with c) &&

// ( m is not in S’) ) {

// add m to S’

// }

// if (S’ is empty)

// return empty set

// At this point, S’ is not empty and it contains the nodes that

// can be reached from S by consuming the character c directly

// 2. find all nodes that can be reached from the resulting

// set S’ by consuming no input

// changed = true

// S’’ = empty set

// while (changed) {

// changed = false

// for every node n in S’ {

// add n to S’’

// for ever neighbor m of n {

// if ( (the edge from n to m labeled with ’_’) &&

// ( m is not in S’’) )

// add m to S’’

// }

// if (S’ not equal to S’’) {

// changed = true;

// S’ = S’’

// S’’ = empty set

// }

// at this point the set S’ contains all nodes that can be reached

// from S by first consuming C, then traversing 0 or more epsilon

// edges

// return S’

}

Figure 7: Pseudocode for matching one character

Figure 8: Regular expression graph ((a)*).((b).(b))

S, _Match_{One Char()}will find all the nodes that can be reached from _Sby consuming the single character _c.

In order to match a whole string, we need to match the characters of the strings one after another. At each step, the solution will keep track of the set of nodes _Sthat can be reached by consuming the prefix of the input string that has been processed so far.

To implement _match(r,s,p), we start with the set of nodes that can be reached from the starting node of _rby consuming no input. Then we repeatedly call _{Match One}_Char()for successive characters of the string _sstarting from position _puntil the returned set of nodes _Sis empty or we run out of input. If at any point during the repeated calls to _Match_{One Char()}the set _Sof nodes contains the accepting node, we note the fact that the prefix of string _sstarting from position _pup to the current position is matching. At the end of the calls to _{Match One}_Char()when _Sis empty or the end of input is reached, the last matched prefix is the one returned by _match(r,s,p). If none of the prefixes are matched, then there is no match for _rin _sstarting at _p.

Note 4

The algorithms given above are not the most efficient, but they are probably the simplest to implement the matching functions.
The algorithm uses sets, so you need to have a representation for a set of nodes and to do operations on sets of nodes.

6.4 Detailed Example for Implementing match(r,s,p)

In this section, I illustrate the steps of an execution of _match(r,s,p)on the REG of

(((a)*).((b).(b)))|((a)*) shown in Figure 9. The input string we will consider is the string _{s = “}_a_a_ba”and the initial position is _{p = 0}.

Figure 9: Regular expression graph _{(((a)*).((b).(b)))|((a)*)}

Initially, the set of states that can reached by consuming no input starting from node 17 is

S₀= _{17,3,1,4,9,15,13,16,18_}

Note that _S₀contains node ₁₈which means that the empty string is a matching prefix. This means that this expression should result in a “_{EPSILON IS NOOOOOT A TOKEN !!!}” error if it is used in a token specification.

Consuming _a.

The set of states that can be reached by consuming _astarting from _S₀is

S₁= _{2,14}

The set of states that can be reached by consuming no input starting from _S₁is

S₁ = _{2,1,4,9,14,13,16,18_}

Note that _S₁ contains node ₁₈, which means that the prefix _“_a_”is a matching prefix.

Consuming _a

The set of states that can be reached by consuming _astarting from _S₁ is

S₂= _{2,14}

The set of states that can be reached by consuming no input starting from _S₂is

S₂ = _{2,1,4,9,14,13,16,18_}

Note that _S₂ contains node ₁₈, which means that the prefix _“_a_a_”is a matching prefix.

Consuming _b

The set of states that can be reached by consuming _bstarting from _S₂ is

S₃= _{10}

The set of states that can be reached by consuming no input starting from _S₃is

S₃ = _{10,11}

Note that _S₃does not contain node ₁₈which means that _“_a_a_b”is not a matching prefix, but is still a viable prefix, which means that there is hope we can read more characters that will turn it into a matching prefix.

Consuming _a

The set of states that can be reached by consuming _astarting from _S₃ is

S4 = {}

Since _S₄is empty, _“_a_a_ba”is not viable and we stop.

The longest matching prefix is _a_a. This is the lexeme that is returned. Note that the second call to _match(r,s,p)starting after _“_a_a_”will return ERROR.

[1] The graph is a representation of a non-deterministic finite state automaton

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

[SOLVED] CSE340 Project 2: Generating a lexical analyzer automatically!

CSE340S21_Project2

Introduction

3 Input Format

3.1 Examples

4 Output Format

5 Requirements and Grading

6 How to Implement a Solution

6.1 Set of Strings Represented by Regular Expressions

6.1.1 General description

6.1.2 Examples

6.2 Constructing REGs

6.2.1 Data Structures and Code for REGs

6.2.2 Detailed Examples for REG Construction

6.3 Implementing match(r,s,p)

6.4 Detailed Example for Implementing match(r,s,p)

Reviews

Related products

[Solved] CSE340 Project 1: A Simple Compiler!

[SOLVED] CSE340 Project 1: Lexical Analysis

[Solved] SOLVED:CSE340 Project 2: FIRST & FOLLOW

[SOLVED] CSE340 HOMEWORK 1

[Solved] SOLVED:a small compiler that will read an input program and represents it in an internal data structure

[Solved] CSE340 Project 1: A Simple Compiler!!