5/5 – (2 votes)
In this programming assignment, you will gain experience with an advanced tree data structure that is used to store strings, and which allows for efficient insertion and lookup: a trie!By completing this assignment and reflecting upon the awesomeness of tries, you will fortify your knowledge of algorithms and data structures and solidify your mastery of many C programming topics you have been practicing all semester: dynamic memory allocation, file I/O, processing command line arguments, dealing with structs and pointers to structs, and so much more.In the end, you will have implemented a tremendously useful data structure that has many applications in text processing and corpus linguistics.
AttachmentsTrie.h, corpus{01-05}.txt, input{01-05}.txt, output{01-05}.txt, printTrie.txtDeliverablesTrie.c1. Overview: TriesWe have seen in class that the trie data structure can be used to store strings. It provides for efficient string insertion and lookup; insertion into a trie is O(k) (where k is the length of the string being inserted), and searching for a string is an O(k) operation (worst-case). In a trie, a node does not store the string it represents; rather, the edges taken to reach that node from the root indicate the string it represents. Each node contains a flag (or count) variable that indicates whether the string represented by that node has been inserted into the trie (or how many times it has been inserted). For example:
Figure 1:This is a trie that codifies the words and, cherry, chocolate, like, and pie. The string pie is represented (or counted) twice. All other strings are counted only once.00 0 1 0 00010201ahdpiei likec0n0 0e o0 00 01 0rrycol001aterootpie (2)like (1)chocolate (1)cherry (1)and (1)1.1. TrieNode Struct (Trie.h)In this assignment, you will insert words from a corpus (that is, a body of text from an input file) into a trie. The struct you will use for your trie nodes is as follows:typedef struct TrieNode{// number of times this string occurs in the corpusint count;// 26 TrieNode pointers, one for each letter of the alphabetstruct TrieNode *children[26];// the co-occurrence subtrie for this stringstruct TrieNode *subtrie;} TrieNode;You must use this trie node struct, which is specified in Trie.h without any modifications. You should#include the header file from Trie.c like so:#include Trie.hNotice that the trie node, because it only has 26 children, represents strings in a case insensitive way(i.e., apple and AppLE are treated the same in this trie).1.2. Co-occurrence SubtriesWords that appear together in the same sentence are said to co-occur. Consider, for example, the following sentence:I like cherry pie and chocolate pie .In the example sentence, the word chocolate co-occurs with the following terms (with counts in parentheses): and (1), cherry (1), i (1), like (1), pie (2).For the purposes of this assignment, we will never consider a word to co-occur with itself. So, in the example sentence above, pie co-occurs with: and (1), cherry (1), chocolate (1), i (1), like (1). Notice that pie itself is not in the list of words co-occurring with pie.If we place these terms (and their associated counts) into their own tries, those tries will look like this:Figure 2:Co-occcurrence subtries for chocolate (left) and pie (right), based on the sentence, I like cherry pie and chocolate pie.In Figure 2, the trie on the left is what we will call the co-occurrence subtrie for the word chocolate, based on the example sentence above (I like cherry pie and chocolate pie). Its root should be stored in the subtrie pointer field of the node marked chocolate in the original trie diagram in Figure 1 (see page 2 above).Similarly, the trie on the right in Figure 2 is the co-occurrence subtrie for the word pie. Its root should be stored in the subtrie pointer field of the node marked pie in the original trie diagram in Figure 1 (see page 2, above).Within these subtries, all subtrie pointers should be initialized to NULL, because we will NOT produce sub-subtries in this assignment!00 0 1 0 00010201ahdpiei likec0n0e001rryroot 00 0 1 000101ahdi likec0n0e001rryroot00c000olao01et2. Input Files and Output Format2.1. Command Line ArgumentsYour program will take two command line arguments, specifying two files to be read at runtime:./a.out corpus01.txt input01.txtThe first filename specifies a corpus that will be used to construct your trie and subtries. The second filename specifies an input file with strings to be processed based on the contents of your trie.For more information on processing command line arguments with argc and argv, see the instructions from the Program #3 PDF. You can assume that I will always specify valid input files when I run your program.2.2. Corpus File2.2.1 FormatThe corpus file contains a series of sentences. Each line contains a single sentence with at least 1 word and no more than 20 words. Each word contains fewer than 1024 characters. Each sentence is terminated with a single period, and that period is always preceded by a space. For example:corpus01.txt:I like cherry pie and chocolate pie .2.2.2 Building the Main TrieFirst and foremost, each word from the corpus file should be inserted into your trie. If a word occurs multiple times in the corpus, you should increment count variables accordingly. For example, the trie inFigure 1 (see page 2, above) corresponds to the text given in the corpus01.txt file above.
Reviews
There are no reviews yet.