Part 1: Becoming familiar with the code
HashTable code has been given to you. No testing program has been provided. To become familiar with how the code works, try reading in a small input file and make sure you can create a hash table of those entries.
Testing: Make sure the following works:
- Insert values
- Delete values
- Find values
- Printing the contents of the hash table.
- Control the size of the hash table.
What happens if you attempt to delete an item that isnt there?
What if you add more things than can fit into the hash table?
Modification: You have been given the hash table code from your text for doing quadratic probing. Modify the code so it takes the key and the associated object as two separate parameters.
Please retain the generic structure of the hash table. Be careful not to add anything to the hash table that only makes sense for this specific problem.
Part 1 is for your benefit. The code you turn in does not need to show the results of this experimentation.
Part 2 Using the code to solve a bigger problem
This assignment is designed to give you experience with hash tables. In this assignment, you will randomly create a new verse of poetry (based on the pattern seen in the existing poem).
You will fit a bi-word language model to English and then use it to generate a poem. A uni-word model of English consists of a single probability distribution P(W) over the set of all words.
A bi-word model of English consists of two probability distributions: P(W) and P(Wi+1 | Wi). The first distribution is just the probability a word in a document. The second distribution is the probability of seeing word Wi+1 given that the previous word was Wi.
Given a set of documents (in our case, various poems), your job in this assignment is to generate a poem from a given word. The method is simple. You just need to decide which word should follow a given word. The word selected is based on the probability in which it occurs (in the file) after the last word used. So for example, in the text below, if I ask, What should follow sam?, you pick I most of the time (7 out of 8 times) and sam the rest of the time. No other word follows sam.
I am Sam. I am Sam. Sam I Am.
That Sam I Am! That Sam I Am! I do not like that Sam I Am!
Do you like green eggs and ham?
I do not like them, Sam I Am.
I do not like green eggs and ham.
Would you like them here or there?
I would not like them here or there.
I would not like them anywhere.
I do not like green eggs and ham.
I do not like them, Sam I Am.
Thus you need to keep track of the number of times that each word follows every other word (so you can figure probabilities).
Keep track of all words (and their probabilities) using a hash table in which you hash on word W. For the input above, partial contents of the hash table are shown below. Thus, do occurs 6 times in the file. Five times do is followed by not and one time it is followed by you
Similarly, for the input file, when you see the word sam, seven times out of 8 i is the next word. One time out of 8, sam is the next work.
Given the information in this data structure, we can compute the conditional probability that i occurs once you are looking at word sam P(i|sam) as 7/8. Similarly, we can compute the probability P(ham|and) as 3/3 = 1. When I ask what should follow sam, you will pick the next word given the probabilities seen. In other words, 87.5% of the time you will want i to come next (after sam).
Given a specific input file, you are to generate a poem of a given length from a specific starting word. So for example, poem(sam, 20) is to generate a poem of 20 words that begins with the word sam. Your poem might look like: sam I am sam I am I am I would not like them here or there I am do not like or sam I would you like green eggs and ham I am do you like them sam I am I am I
Specifics
For the data, remove all special characters and punctuation and change everything to lower case.
Output
There are several data files associated with the test. You will generate poems from the following specification:
File | Beginning word | Length of Poem | Print Hash Table? |
green.txt | sam | 20 | Yes |
test.txt | that | 20 | Yes |
clown.txt | go | 20 | Yes |
inch.txt | computer | 50 | no |
Poe.txt | nevermore | 50 | no |
Seuss.txt | mordecai | 50 | no |
For grading purposes, you are asked to print the contents of the hash table for some of the test cases. This will be a huge benefit to you in debugging. Because the next word is generated via a random number generator, you should not expect your output to match that of others.
Hints
In order to generate which word follows word WORD with the correct probability, I used the following strategy. I kept track of how many times WORD occurred (call it occurCt) and how many times each word followed WORD (followCt). I generated a random number between 0 and occurCt. Then I considered each word which followed WORD in order. I added up the followCts of each following word until the sum of the followCts passed the random number. When the sum passed the random number, that is the word I selected to be the next word in the poem.
What to Turn In
Turn in your complete project file. In a readMe file, please write a one-paragraph analysis of the generated texts addressing the following points:
- What knowledge do human beings have that is NOT captured by this language model and that therefore causes the model to generate nonsense. Write down at least three facts that, if the computer could somehow know them, would allow it to generate more sensible language.
- Compare the different poems. If your experience is like mine, some will be more coherent than others. Speculate about why this is true.
Reviews
There are no reviews yet.