Youve been hired to answer the centuries-old debate on who wrote Shakespeares plays. Was it Shakespeare or Sir Francis Bacon? To solve this question, you dont need to be a historian, but a computer scientist.For this project, you will:Be introduced to a DataCounter ADT, and understand the strengths and limitations of differentDataCounter implementations. AVL Trees HashTable Heap Learn about word stemmingHere is some background information which will help you understand the project:Word Frequency AnalysisAuthors tend to use some words more often than others. For example, Shakespeare used thou more often than Bacon. The professor believes that a signature can be found for each author, based on frequencies of words found in the authors works, and that this signature should be consistent across the works of a particular author but vary greatly between authors.Your goal in this project is to come up with a way quantifying the difference between two written works, and to use your technique on several of Shakespeares and Bacons works to settle the ancient debate!You are provided with copies of Shakespeares (Hamlet) and Bacons writing (The New Atlantis). It is impossible to draw strong conclusions based on so little data. So you will have several more works, as many works as you feel is necessary to support your conclusion.Word StemmingWhen dealing with document correlations, it is often desirable to work only with the roots of words. That way, sleeps, sleeping, and sleep are all considered to be the same word. This process is called word stemming, and is used in most real-world search engines.For this assignment, you only need to follow two simple rules:1) Convert all the words to lowercase (An and an are the same word)2) Remove all punctuation (end. and end are the same word)The supplied class FileWordReader includes code to do this processing for you.Word stemming is a fairly complex topic. What these rules do is not so much word stemming as input normalization; you do not try to undo conjugations or other morphology. Fancier wordstemming such as removing s from the end of a word can lead to erroneous results (such as bu from bus) and require special logic. Even our simple rules cause problems; for instance, its and its are now the same word. Implementing a better stemming algorithm(like Porter Stemming) is above and beyond work.
Signature GenerationA fundamental part of your work lies in computing the signature of a document. You are given a sample WordCount program that reads in a document and counts the number of times that a stemmed word appears, assuming that the documents words are already stemmed. The output of this program looks like this:970 the708 and666 of632 to521 at521 i521 into466 a444 my391 in383 buffalowhere the number in column 1 is the frequency that the corresponding string in column 2 occurs in the text. Note that the WordCount program sorts its output primarily in decreasing order by frequency count, secondarily by alphabetical order. The ordering would be identical if it had sorted by frequency fraction first (i.e. frequency_count/num_total_words).Document CorrelationDocument correlation is a reasonably large area of study. Perhaps its most visible application is in search engines which rank the correlation of webpages to a set of keywords that you provide. One model often used for correlating documents is Latent Semantic Indexing (LSI) where a collection of documents is considered together (rather than independently) and a words usefulness is determined by how frequently it appears in the documents (for instance,the isnt very useful because it appears in most documents).We will not be doing LSI. We will do a simpler document comparison: Calculate word counts for the two documents and normalize the frequencies so that they can be meaningfully compared between different documents (hint: use frequency percentages or fractions.) As in LSI, remove words whose relative frequencies are too high or too low to be useful to your study. A good starting point is to remove words with normalized frequencies above 0.01 (1%) and below 0.0001 (0.01%), but feel free to play around with these numbers. You should, however, report results with these frequencies so we cancompare outputs with a standard set of numbers. For every word that occurs in both documents, take the difference between the normalized frequencies, square that difference, and add the result to a running sum. The final value of this running sum will be your difference metric. This metric corresponds to the square of the Euclidean distance between the two vectors in the space of shared words in the document. Note that this metric assumes that words not appearing in both documents do not affect the correlation.
RequirementsThere are five steps in this project:1. Write two DataCounter dictionary implementations (AVL, Hash)2. Modify WordCount to be able use your DataCounter implementations, and to select theimplementation at runtime.3. Write a document correlator that will print a difference score between two documents4. Study the time differences of the data structures and correlator.5. Analyze and write up your conclusionsDataCounter ImplementationsFor this assignment, you must implement two data structures (you may reuse previousImplementations, use book or slides as resources):AVL tree Your implementation MUST extend BinarySearchTree included in the zip filesregardless if you choose to modify the existing code for BinarySearchTree.Hash table You must implement your own hash code for the hash table. Do not useString.hashCode()Both of these data structures must implement the DataCounter interface. You do not needto implement remove in any of thesestructures. You can implement any hash tables discussed in class; the only restriction is thatit should not restrict the size of the input domain or the number of inputs (i.e. your hashtable must grow).WordCountThe WordCount program will read in a text file and tally up all the words that appear in it.The WordCount program given to you currently uses an unbalanced binary search tree as itsbacking DataCounter implementation and contains an insertion sort.Your turned in implementation of WordCount must use heapsort. Therefore, you will needto implement a heap data structure (as usual, no Java libraries for implementing your heap).You will replace the insertion sort with a heapsort.You may base your WordCount program on it, or write your own. You need to addadditional DataCounter implementations.The commandline form for WordCount will be as follows:java WordCount [ -b | -a | -h ] [ -frequency | -num_unique ] <filename> -b Use an Unbalanced BST to implement the DataCounter -a Use an AVL Tree -h Use a Hashtable -frequency Print all the word/frequency pairs, ordered by frequency, and then by thewords in lexicographic order -num_unique Print the number of unique words in the document. This is the totalnumber of distinct (different) words in the document. Words that appear more than onceare only counted as a single word for this statistic.It is fine to require that one of -b, -a, or -h must be specified for your program to run. Yourprogram should not crash, however, if given an invalid command line. Note that for the frequency option, you need to produce words ordered primarily by frequency andsecondarily by lexicographic (i.e., alphabetical) order. For example:43 these42 upon42 your41 after41 into40 said39 many39 more38 anThe sample WordCount program does this sorting using Insertion Sort. However, you will beupdating the sorting to use heapsort.Document CorrelatorThe Document Correlator should take in two documents and return the correlation(difference metric calculation) between them. You may want to use the WordCount classgiven to you to implement the backend of the Correlator, though doing so is not necessary.Document Correlation. This program should also take command line flags to specify whichbacking data structure to use.The commandline structure should be:Usage: java Correlator [ -b | -a | -h ] <filename1> <filename2> -b Use an Unbalanced BST in the backend -a Use an AVL Tree in the backend -h Use a Hashtable in the backendBenchmarksSince we are implementing two DataCounter dictionaries in this project, it is natural to askwhich is faster. Benchmarking and profiling are really the only reliable ways to judge thissince there are many many hidden assumptions in the way you write your code that willadd unexpected constants to your program. Hopefully you will do some exploration in thisassignment and prove to yourself that you really cant predict what will affect programruntime too much (go through and try to optimize away little things like how manyassignments you do, how many if statements you execute, etc. and see how much or littlethis affects your program).When generating (or reading) benchmarks, you must ask yourself the following questions: What am I measuring? Speed is too vague. Does it mean full program runtime? What ifmy program waits for user input? Does it matter? Why am I measuring this and why should anyone be interested in it? Full program runtimeof an interactive user app where the users fall asleep while running the code isnt reallyinteresting data. What methodology will I use to measure my program? Does it actually measure what Iwant? What are the sources of error? Is the error big enough to matter? Are my results stillreliable?This set of questions actually applies to any analysis.You are required to design benchmarks that measure the attributes listed below. You mayalso include any other data that you feel necessary to draw conclusions from thebenchmarks in your analysis.How fast is java WordCount -frequency compared to count.sh and count.pl?How much difference in speed do your different DataCounters make in the correlatorand/or the wordcount?There are a few tools available to you for benchmarking. The simplest ones are: The Unix time command. System.nanoTime() or System.currentTimeMillis() (record the time at two different placesin your program and subtract to get running time).Both methods have their strengths and weaknesses (for instance, time must measure yourprocess creation times, and JVM startup times). These strengths and weaknesses willexhibit themselves differently depending on where and how these tools are used. In youranalysis, you will need to cite the known sources for errors in your benchmarks and justifywhy they dont matter for your measurements, or somehow create a correction for yourmeasurement.Essentially, you must convince us that your benchmark is measuring something that makessense and that your analysis can be based off the collected data.For example, to time count.sh or count.pl, you can do the following:time ./count.sh your-fileThe syntax is the same for count.pl.README.txt QuestionsYour README.txt file needs to answer the following questions:1. Who is in your group?2. How long did the project take?3. Before you started, which data structure did you expect would be the fastest?4. Which data structure is the fastest? Why were you right or wrong?5. In general, which DataCounter dictionary implementation was better: trees orhash tables? Note that you will need to define better (ease of coding, ease ofdebugging, memory usage, disk access patterns, runtime for average input, runtimefor all input, etc).6. Are there cases in which a particular data structure performs really well or badly inthe correlator? Enumerate the cases for each data structure.7. Give a one to two paragraph explanation of whether or not you think Bacon wroteShakespeares plays based on the data you collected. No fancy statistical analysishere.8. What did you enjoy about this assignment? What did you hate? Could we have doneanything better?Files and Sample CodeSample texts and starter code are available on the course web site in file project3files.zip.You can also get texts of your own at Project Gutenberg, which has thousands of books asplain text files. Their mission is to provide electronic versions of many popular publicdomain texts. Check it out!Note that these files contain some header text that is not part of the actual play or book.For accurate results you should remove everything that is not part of the work. Extra textoccurs at the beginning and end or each file, and sometimes at the beginning of each act ofa play. Try running your word-counting program on the King James Bible. (Guess whichword comes up more frequently in the Bible: he or she? and by a factor of what?).Also, if you have any special requests for texts or other cool files youd like to have added tothe test files, email me.In addition, I provide some code which you may use it if you wish, although your code mustfollow the provided DataCounter interface: DataCountero Specification of an interface for a DataCounter. Your classes must implementthis interface. (Note that DataCounter is a dictionary that is specialized forthis particular task, so it isnt as generalized as some of the ADTs weve seenin the past. This is primarily for performance reasons.) BinarySearchTreeo Specification and implementation of an unbalanced binary search tree class.Use of the provided BinarySearchTree implementation is optional (you maychoose to implement your own), but your AVL tree class must inherit fromyour BinarySearchTree implementation. BinarySearchTree.BSTNodeo Specification and implementation of a binary search tree node (used in theBinarySearchTree class). FileWordReader A class that reads in a file and does simple word stemming. WordCount A simple program that reads words from a FileWordReader and talliestheir frequency in a DataCounter. count.sh A Unix shell script to compute word counts. Usage: ./count.sh your-file count.pl Similar to count.sh, except in Perl.Writeup turn-inYour README.txt.This assignment has become a fixture in my old UW data structures class, and has evolveda little or a lot over the years!
Reviews
There are no reviews yet.