PIC 10A Section 2 Homework #9 (due Wednesday, december 4, by 6 pm)
You should upload each .cpp (and .h file) separately and submit them to CCLE before the due date/time! Your work will otherwise not be considered for grading. Do not submit a zipped up folder or any other type of file besides .h and .cpp.
Be sure you upload files with the precise name that you used in your edi- tor environment otherwise there may be linker and other errors when your homeworks are compiled on a different machine. Also be sure your code com- piles and runs on Visual Studio 2019.
VOCABULARY COMPARISON
This assignment is focused on building familiarity with streams and data structures. In this assignment, you are to write a simple vocabulary comparison tool. You will sub- mit 3 files in the end: Vocabulary.h (providing declarations of constructors/member functions/functions), Vocabulary.cpp (providing definitions), and Compare.cpp (pro- viding the main routine).
Here is what the program will do:
The user will be prompted to list as many files as they like ( could be 2, 3, 4, 10, etc.), all separated by spaces. These files will then be compared against each other in all possible pairwise comparisons for their vocabulary. Similarities between files will be computed. The similarities will all be printed to the console. In addition, suppose there are N files that were compared. Then there will be a filed Results Compare N.txt where N appears directly as a number will be generated to store the identical output as was on the console.
In this homework, you may not use std::to string or std::stoi and the likes. Any- where where std::stringstreams could be used, you should be using them as practice.
A little more about the similarity score. Let A and B be two files. We define the similarity score S as:
number of words common to both A and B
S=
NA NB
1
where NA is the number of unique words appearing in file A and NB is the number of unique words in file B after all capitalization has been removed. The number S is always in the interval [0, 1] and the larger it is, the more similar two files are in their vocabulary.
The desired format of the running program is below:
Enter all file names for comparison separated by spaces: [USER ENTERS ALL THE FILE NAMES]
Comparison of [FILE NAME] and [OTHER FILE NAME]: [VALUE]
Results have been written to: Results Compare [NUMBER OF FILES].txt
Also see the screen shot.
To manage the comparisons, you should write a class VocabWrapper and a function similarity. The VocabWrapper class should:
store the list of words in an appropriate structure and the name of the associated file;
have a constructor that accepts the name of a file, initializing the filename;
have a get filename function returning the filename;
have a read vocab function that reads all of the words in from the file of the given name turning all capital letters to lowercase!!!!!!!!!!!!
have a word count function returning how many unique words there were in the associated file; and
have an overlap count function, accepting another VocabWrapper class, return- ing a count of how many words their two associated files have in common (after capitalizations have been removed).
The function similarity should compute the similarity between two inputs of type VocabWrapper.
Note that the files will be given to you without any punctuation. But you must ensure all words are represented without capital letters.
2
Word multiplicity is to be neglected in this homework. Whether a file has the word gravitation appearing once or three-hundred times, as far as our simple similarity score is concerned, it happened once! Some important details:
1. The users files will be in the same folder as the .cpp, .h, or .exe files, and the files must be saved to the same folder.
2. You may assume the files contain no punctuation marks.
You can test your code against the sample input files and output file provided (yes, there are typos in the files). A sample output is provided in this document.
The texts provided are samples of 250 words sourced from the following links:
https://www.gutenberg.org/files/30155/30155-0.txt The Special and General The- ory, by Albert Einstein
http://www.gutenberg.org/cache/epub/60271/pg60271.txt From Newton to Ein- stein, by Benjamin Harrow
https://www.gutenberg.org/files/52521/52521-0.txt Grimms Fairy Tales
Remarks: there are much better text comparison algorithms out there but to avoid going too heavy into machine learning and algorithms, this is sufficient. Two very natural improvements would be to (i) remove stop words, i.e., words like a, the, and, etc., that appear in almost all text and (ii) to consider word frequency. Considering word semantics and topic representations of the documents would be huge improvements.
If youre wondering about the math behind the similarity score S: imagine encoding all words in the English language in {0,1}N where N is the number of possible words. We treat each word as being orthogonal to every other distinct word. This means we couldimagineencodingphysicsas(1,0,0,0,)T andbearas(0,1,0,0,)T,etc.We neglect the word multiplicity in each document and then compute the cosine similarity score between two documents to find S.
3
Reviews
There are no reviews yet.