, , , , ,

[SOLVED] CSE340 Project 1: Lexical Analysis Solution

$25

File Name: CSE340_Project_1:_Lexical_Analysis_Solution.zip
File Size: 405.06 KB

5/5 - (1 vote)

The goal of this project is to show you how a description of a list of tokens can be used to automatically generate a lexical analyzer. The input to your program will be a description of tokens using regular expressions together with a text. Your program will generate data structures that will allow it to do lexical analysis on the text according to the tokens. The output of the program will be a list of tokens or an error message if the input does not have the correct format.The input of your program is specified by the following context-free grammar:WhereIn the description of regular expressions, UNDERSCORE represents epsilon.The following are examples of input.a aa bb aabThis input specifies three tokens t1, t2, and t3 and an INPUT_TEXT a aa bb aab.a aa bb aad aaThis input specifies two tokens t1, t2, and an INPUT_TEXT a aa bb aad aa.aaabbcaaaaThis input specifies three tokens whose names are t1, t2, and t3 and an INPUT_TEXT aaabbcaaaa.The input format can be confusing because there are two kinds of tokens.You should not confuse the two kinds of tokens.The output will be a sequence of tokens and their corresponding lexemes according to the list of tokens provided in the input or SYNTAX ERROR:where t is the name of a token and lexeme is the actual lexeme for the token t. If during lexical analysis of INPUT_TEXT, a syntax error is encountered then ERROR is printed on a separate line and the program exits.In doing lexical analysis for INPUT_TEXT, SPACE is treated as a separator and is otherwise ignored.Each of the following examples gives an input and the corresponding expected output.a aa bb aabThis input specifies three tokens t1, t2, and t3 and an INPUT_TEXT a aa bb aab. Since the input is in the correct format, the output of your program should be the list tokens in the INPUT_TEXT: t1 , a t2 , aa t3 , bb t3 , aaba aa bb aad aaSince the input is in the correct format, the output of your program should be the list tokens in the INPUT_TEXT the output of the program should bet1 , a t2 , aa t3 , bb t2 , aaERRORNote that the input is in the correct format, but doing lexical analysis for INPUT_TEXT according to the list of tokens produces ERROR after the second t2 token because there is no token that starts with d.This input specifies three tokens whose names are t1a, t2bc, and t34 and an input text aaabbcaaaa. Since the input is in the correct format, the output of your program should be the list tokens in the INPUT_TEXT:t34 , aaabbc t2bc , aaaaYou should write a program to generate the correct output for a given input as described above.You will be provided with a number of test cases. Since this is the first project, the number of test cases provided with the project will be relatively large.In your solution, you are not allowed to use any built-in or library support for regular expressions in C/C++. This requirement will be enforced by checking your code.The main difficulty in coming up with a solution is to transform a given list of token names and their regular expression descriptions into a getToken() function for the given list of tokens. This transformation will be done in three high-level steps:graph, or REG for short.Figure 1: Regular expressions graphs for the base casesFigure 2: Regular expression graph for the an expression obtained using the dot operatorThe construction of REGs is done recursively. The construction we use is called Thompsons construction. Each REG has a one starting node and one final node. For the base cases of epsilon and a, the REGs are shown in Figure 1. For the recursive cases, the constructions are shown in Figures 2, 3, and 4. An example REG for the regular expression ((a)*).((b)*) is shown in Figure 5.In the construction of REGs, every node has at most two outgoing arrows. This will allow us to use a simple representation of a REG node.Figure 3: Regular expression graph for the an expression obtained using the or operatorFigure 4: Regular expression graph for the an expression obtained using the star operatorstruct REG_node { struct REG_node * first-neighbor; char first_label;struct REG_node * second_neighbor; char second_label; }In the representation, first_neighbor is the first node pointed to by a node and second_neighbor is the second node pointed to by a node. first_label and second_label are the labels of the arrows from the node to its neighbors. If a node has only one neighbor, then second_neighbor will be NULL. If a node has no neighbors, the both first_neighbor and second_neighbor will be NULL.struct REG { struct REG_node * starting;struct REG_node * accepting; // note that I changed final to accepting because final // is reserved in C++ }Figure 5: Regular expression graph for the an expression obtained using concatenation and star operatorsThe function parse_regular_expr() should not only parse a regular expression, but should also return the REG of the regular expression that is parsed. The construction of REGs is done recursively. An outline of the process is shown on the next page.struct REG * parse_regular_expression(){// if expression is UNDERSCORE or a CHAR, say a for example// create a REG for the expression and return it// (see Figure 1, for how the REG looks like)// if expression is (R1).(R2)// the program will call parse_regular_expr() twice, once// to parse R1 and once to parse R2//// Each of the two calls will return a REG, say they are// r1 and r2//// construct a new REG r for (R1).(R2) using the// two REGs r1 and r2// (see Figure 2 for how the two REGs are combined)//// return r//// the cases for (R1)|(R2) and (R)* are similar}I consider the regular expression ((a)*).((b)*) and explain step by step how its REG is constructed (Figure 5).When parsing the ((a)*).((b)*), the first expression to be parsed is a and its REG is constructed according to Figure 1. In Figure 5, the nodes for the REG of the regular expression a have numbers 1 and 2 to indicate that they are the first two nodes to be created.The second expression to be parsed when parsing ((a)*).((b)*) is (a)*. The REG for (a)* is obtained from the REG for the regular expression a by adding two more nodes (3 and 4) and adding the appropriate arrows as described in the general case in Figure 4. The starting node for the REG of (a)* is the newly created node 3 and the accepting node is the newly created node 4.The third regular expression to be parsed while parsing ((a)*).((b)*) is the regular expression b. The REG for regular expression b is constructed as shown in Figure 1. The nodes for this REG are numbered 5 and 6. The fourth regular expression to be parsed while parsing ((a)*).((b)*) is (b)*. The REG for (b)* is obtained from the REG for the regular expression b by adding two more nodes (7 and 8) and adding the appropriate arrows as described in the general case in Figure 4. The starting node for the REG of (b)* is the newly created node 7 and the accepting node is the newly created node 8.Finally, the last regular expression to be parsed is the regular expression ((a)*).((b)*). The REG of ((a)*).((b)*) is obtained from the REGs of (a)* and (b)* by creating a new REG whose initial node is node 3 and whose accepting node is node 8 and adding an arrow from node 4 (the accepting node of the REG of (a)*) to node 7 (the initial node for the REG of (b)*).Another example for the REG of (((a)*).((b).(b)))|((a)*) is shown in Figures 7 and 8. In the next section, I will use this example to illustrate how match(r,s,p) can be implemented.Given an REG r, a string s and a position p in the string s, we would like to determine the longest possible lexeme that matches the regular expression for r.As you will see in CSE355, a string w is in L(R) for a regular expression R with REG r if and only if there is a path from the starting node of r to the accepting node of r such that w is equal to the concatenation of all labels of the edges along the path. I will not into the details of the equivalence in this document. I will describe how to find the longest possible substring w of s starting at position p such that there is a path from the starting node of r to the accepting node of r that can be labeled with w.To implement match(r,s,p), we need to be able to determine for a given input character a and a set of nodes S the set of nodes that can be reached from nodes in S by consuming a. To consume a we can traverse any number of edges labeled _, traverse one edge labeled a, then traverse any number of edges labeled _.At each step, the solution will keep track of the set of nodes S that can be reached by consuming a prefix of the input string so far. Initially S is the set of nodes that can be reached from the starting node by not consuming any input (empty prefix).For a given character a and a given set S, the solution will find all the nodes that can be reached from S by consuming a. This will be implemented in a function that can be called match_one_char() shown in Figure 6.To implement match(r,s,p), we start with the set of nodes that can be reached from the starting node of r by consuming no input. Then we repeatedly call match_one_char() for successive characters of the string s starting from position p until the returned set of nodes S is empty or we run out of input. If at any point during the repeated calls to match_one_char() the set S of nodes contains the accepting node, we note the fact that the prefix of string s starting from position p up to the current position is matching. At the end of the calls to match_one_char() when S is empty or the end of input is reached, the last matched prefix is the one returned by match(r,s,p). If none of the prefixes are matched, then there is no match for r in s starting at p.set_of_nodes match_one_char(set_of_nodes S, char c){// 1. find all nodes that can be reached from S by consuming c//// S = empty set// for every node n in S// if ( (there is an edge from n to m labeled with c) &&// ( m is not in S) ) {// add m to S// }//// if (S is empty)// return empty set//// At this point, S is not empty and it contains the nodes that// can be reached from S by consuming the character c directly////// 2. find all all nodes that can be reached from the resulting// set S by consuming no input//// changed = true// while (changed) { // changed = false// for every node n in S// if ( (there is an edge from n to m labeled with _) &&// ( m is not in S) ) {// add m to S// changed = true// }// }//// at this point the set S contains all nodes that can be reached // from S by first consuming C, then traversing 0 or more epsilon// edges//// return S}Figure 6: Pseudocode for matching one character.Note. The algorithms given above are not the most efficient, but they are probably the simplest to implement the matching functions.In this section, I illustrate the steps of an execution of match(r,s,p) on the REG of (((a)*).((b).(b)))|((a)*) shown in Figure 8. The input string we will consider is the string s = aaba and the initial position is p =The set of states that can reached by consuming no input starting from node 17 is S0 = {17,3,1,4,9,15,13,16,18} Note that S0 contains node 18 which means that the empty string is a matching prefix.The set of states that can be reached by consuming a starting from S0 is S1 = {2,14}Figure 7: Regular expression graph ((a)*).((b).(b))Figure 8: Regular expression graph (((a)*).((b).(b)))|((a)*)The set of states that can be reached by consuming no input starting from S1 is S1_ = {2,1,4,9,14,13,16,18} Note that S1_ contains node 18, which means that the prefix a is a matching prefix. 3. Consuming aThe set of states that can be reached by consuming a starting from S1_ is S2 = {2,14}The set of states that can be reached by consuming no input starting from S2 is S2_ = {2,1,4,9,14,13,16,18} Note that S2_ contains node 18, which means that the prefix aa is a matching prefix.The set of states that can be reached by consuming b starting from S2_ is S3 = {10}The set of states that can be reached by consuming no input starting from S3 is S3_ = {10,11}Note that S3_ does not contain node 18 which means that aab is not a matching prefix, but is still a viable prefix.The set of states that can be reached by consuming a starting from S3_ is S4 = {} Since S4 is empty, aaba is not viable and we stop.The longest matching prefix is aa. This is the lexeme that is returned. Note that the second call to match(r,s,p) starting after aa will return ERROR.Follow these steps:Note that you are required to compile and test your code on CentOS 7 using the GCC compiler version 4.8.5. You are free to use any IDE or text editor on any platform, however, using tools available in CentOS (or tools that you could install on CentOS) could save time in the development/compile/test cycle.The submissions are evaluated based on the automated test cases on the submission website. Your grade will be proportional to the number of test cases passing. If your code does not compile on the submission website, you will not receive any points.NOTE: The next two sections apply to all programming assignments.You should use the instructions in the following sections to compile and test your programs for all programming assignments in this course.You should compile your programs with the GCC compilers which are available in CentOS 7. GCC is a collection of compilers for many programming languages. There are separate commands for compiling C and C++ programs:$ g++ test_program.cppIf the compilation is successful, it will generate an executable file named a.out in the same folder as the program. You can change the output file name by specifying the -o option:$ g++ test_program.cpp -o hello.outTo enable C++11 with g++, use the -std=c++11 option:$ g++ -std=c++11 test_program.cpp -o hello.outThe following table summarizes some useful GCC compiler options:You can find a comprehensive list of GCC options in the following page:https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/If your program is written in multiple source files that should be linked together, you can compile and link all files together with one command:$ g++ file1.cpp file2.cpp file3.cppOr you can compile them separately and then link:$ g++ -c file1.cpp $ g++ -c file2.cpp $ g++ -c file3.cpp$ g++ file1.o file2.o file3.oThe files with the .o extension are object files but are not executable. They are linked together with the last statement and the final executable will be a.out.You can replace g++ with gcc in all examples listed above to compile C programs.Your programs should not explicitly open any file. You can only use the standard input e.g. std::cin in C++, getchar(), scanf() in C and standard output e.g. std::cout in C++, putchar(), printf() in C for input/output.However, this restriction does not limit our ability to feed input to the program from files nor does it mean that we cannot save the output of the program in a file. We use a technique called standard IO redirection to achieve this.Suppose we have an executable program a.out, we can run it by issuing the following command in a terminal (the dollar sign is not part of the command):$ ./a.outIf the program expects any input, it waits for it to be typed on the keyboard and any output generated by the program will be displayed on the terminal screen.To feed input to the program from a file, we can redirect the standard input to a file:$ ./a.out < input_data.txtNow, the program will not wait for keyboard input, but rather read its input from the specified file. We can redirect the output of the program as well:$ ./a.out > output_file.txtIn this way, no output will be shown in the terminal window, but rather it will be saved to the specified file.Note that programs have access to another standard stream which is called standard error e.g. std::cerr in C++, fprintf(stderr, ) in C. Any such output is still displayed on the terminal screen. It is possible to redirect standard error to a file as well, but we will not discuss that here. Finally, its possible to mix both into one command:$ ./a.out < input_data.txt > output_file.txtWhich will redirect standard input and standard output to input_data.txt and output_file.txt respectively.Now that we know how to use standard IO redirection, we are ready to test the program with test cases.A test case is an input and output specification. For a given input there is an expected output. A test case for our purposes is usually represented by two files:The input is given in test_name.txt and the expected output is given in test_name.txt.expected. To test a program against a single test case, first we execute the program with the test input data:$ ./a.out < test_name.txt > program_output.txtThe output generated by the program will be stored in program_output.txt. To see if the program generated the expected output, we need to compare program_output.txt and test_name.txt.expected. We do that using a general purpose tool called diff:$ diff -Bw program_output.txt test_name.txt.expectedThe options -Bw tell diff to ignore whitespace differences between the two files. If the files are the same(ignoring the whitespace differences), we should see no output from diff, otherwise, diff will produce a report showing the differences between the two files.We would simply consider the test passed if diff could not find any differences, otherwise we consider the test failed.Our grading system uses this method to test your submissions against multiple test cases. There is also a test script accompanying this project test1.sh which will make your life easier by testing your code against multiple test cases with one command.Here is how to use test1.sh to test your program:NOTE: the actual file name is probably different, you should replace test_cases.zip with the correct file name.The output of the script should be self explanatory. To test your code after you make changes, you will just perform the last two steps (compile and run test1.sh).[1] The graph is a representation of a non-deterministic finite state automaton

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CSE340 Project 1: Lexical Analysis Solution
$25