Summary
In this assignment you’ll be tackling two situations:
- Probability simulation (code in a file called envelope_sim.py)
- Text classification (code in a file called classify.py)
Please submit your files in a zip file called p4_<netid>.zip, where you replace <netid> with your netID (your wisc.edu login).
We’ve discussed both of these problems in lecture, so hopefully you’ve got a bit of understanding of the problems behind them as you start in your implementation.
Part 1: Probability Simulation
We introduced the idea of conditional probability in class with a situation: choose between two envelopes, each containing two balls. If you select the envelope with only black, you get nothing; if you select the envelope with one red and one black, you get a payout.
Once you select an envelope, you blindly draw one ball; it is black. Do you switch envelopes?
The answer is yes; this is a variant of the classic Monty Hall problemLinks to an external site., which uses counter-intuitive conditional probabilities to trick unsuspecting people into worse odds. In class we worked through why you should switch according to probabilities, but let’s actually run through a simulation and see if our math was valid.
For this part of the program, you must write two (2) Python functions:
- pick_envelope(switch, verbose) — this function expects two boolean parameter (whether you switch envelopes or not, and whether you want to see the printed explanation of the simulation) and returns True or False based on whether you selected the correct envelope
- run_simulation(n) — this function runs n simulations of envelope picking under both strategies (switch n times, don’t switch n times) and prints the percent of times the correct envelope was chosen for each
Write other functions as necessary, but you must include these two.
Pick Envelope
We’ll leave the details of this implementation up to you, but be sure to go through the entire simulation process:
- Randomly distribute the three black/one red balls into two envelopes
- Randomly select one envelope
- Randomly select a ball from the envelope:
- if red, you picked the right envelope, return True
- if black, switch or don’t switch according to the value of the argument
- Determine whether you picked the payout envelope and return True if you did; False otherwise
If the verbose parameter is set to False, this function should not display intermediate output and only return True or False.
If the verbose parameter is set to True, format your printed output as follows:
>>> pick_envelope(True, verbose=True) Envelope 0: b b Envelope 1: r b I picked envelope 0 and drew a b Switch to envelope 1 => True >>> pick_envelope(True, verbose=True) Envelope 0: b r Envelope 1: b b I picked envelope 0 and drew a b Switch to envelope 1 => False >>> pick_envelope(False, verbose=True) Envelope 0: r b Envelope 1: b b I picked envelope 1 and drew a b => False >>> pick_envelope(False, verbose=True) Envelope 0: b b Envelope 1: b r I picked envelope 1 and drew a r => True
Run Simulation
Given a value for n, run the simulation you just wrote n times under each strategy (switch/don’t) with verbose set to False (please). Track how many times you choose the correct envelope, and print the results to the console:
>>> run_simulation(10000) After 10000 simulations: Switch successful: 75.25 % No-switch successful: 50.160000000000004 %
Worth thinking about: why are these numbers roughly 75/50% and not 66/33% as proved in class? What did we change about our setup and/or assumptions?
OPTIONAL: String Formatting in Python
If you feel like being fancy and forcing decimal-place precision:
>>> run_simulation(1000) After 1000 simulations: Switch successful: 75.80% No-switch successful: 48.90%
You can use Python’s .format() function as follows:
"this is a string {:.2%} of the time".format(1)
which outputs
'this is a string 100.00% of the time'
You do not need to return anything from this function. Note that your results will not (and should not!) exactly match ours here due to random variation.
Part 2: Document Classification
This next part is where things will get interesting: we’ll be reading in a corpus (a collection of documents) with two possible true labels and training a classifier to determine which label a query document is more likely to have.
Here’s the twist: the corpus is created from your essays about AI100 and the essays on the same topic from 2016, and based on training data from each, you’ll be predicting whether an essay was written in 2020 or 2016. (Your classifier will probably be bad at this! It’s okay, we’re looking for a very subtle difference here.)
- You will need: corpus.tar.gzLinks to an external site. Download corpus.tar.gzLinks to an external site.
For this program, you must write at least seven (7) Python functions:
- train(training_directory, cutoff) — loads the training data, estimates the prior distribution P(label) and class conditional distributions
, return the trained model
- create_vocabulary(training_directory, cutoff) — create and return a vocabulary as a list of word types with counts >= cutoff in the training directory
- create_bow(vocab, filepath) — create and return a bag of words Python dictionary from a single document
- load_training_data(vocab, directory) — create and return training set (bag of words Python dictionary + label) from the files in a training directory
- prior(training_data, label_list) — given a training set, estimate and return the prior probability p(label) of each label
- p_word_given_label(vocab, training_data, label) — given a training set and a vocabulary, estimate and return the class conditional distribution
over all words for the given label using smoothing
- classify(model, filepath) — given a trained model, predict the label for the test document (see below for implementation details including return value)
- this high-level function should also use create_bow(vocab, filepath)
Your submitted code should not produce any additional printed output.
Our Toy Example
For some of the smaller helper functions, we’ll be using a very simple version of a training data directory. The top level directory is called EasyFiles/, and it contains two subdirectories (like your actual training data directory will), called 2020/ and 2016/. EasyFiles/2016/ contains two files, 0.txt (hello world) and 1.txt (a dog chases a cat.), and EasyFiles/2020/ contains one file, 2.txt (it is february 19, 2020.). Each of these files has been pre-processed like the corpus, so all words are in lower case and all tokens (including punctuation) are on different lines:
it is february 19 , 2020 .
You may wish to create a similar directory structure on your computer.
Create Vocabulary
Our training directory structure as provided in the corpus is very intentional: under training/ are two subdirectories, 2016/ and 2020/, each of which function as the labels for the files they contain. We’ve pre-processed the corpus, so that every line of the provided files contains a single token.
To create the vocabulary for your classifier, traverse BOTH of these subdirectories under training/ (note: do not include test/) and count the number of times a word type appears in any file in either directory.
As a design choice, we will exclude any word types which appear at a frequency strictly less than the cutoff argument (cutoff = 1 means retain all word types you encounter). Return a sorted list of these word types.
>>> create_vocabulary('./EasyFiles/', 1) => [',', '.', '19', '2020', 'a', 'cat', 'chases', 'dog', 'february', 'hello', 'is', 'it', 'world'] >>> create_vocabulary('./EasyFiles/', 2) => ['.', 'a']
Create Bag of Words
This function takes a path to a text file (assume valid, one token per line), reads the file in, creates a bag-of-words representation based on the vocabulary, and returns the bag-of-words in dictionary format. Give all counts of word types not in the vocabulary to OOV (see below).
>>> vocab = create_vocabulary('./EasyFiles/', 1) >>> create_bow(vocab, './EasyFiles/2016/1.txt') => {'a': 2, 'dog': 1, 'chases': 1, 'cat': 1, '.': 1} >>> create_bow(vocab, './EasyFiles/2020/2.txt') => {'it': 1, 'is': 1, 'february': 1, '19': 1, ',': 1, '2020': 1, '.': 1}
If you encounter a word type which does not appear in the provided vocabulary, add the non-string value None
as a special key to represent OOV. Collect counts for any such OOV words.
>>> vocab = create_vocabulary('./EasyFiles/', 2) >>> create_bow(vocab, './EasyFiles/2016/1.txt') => {'a': 2, '.': 1, None: 3}
Load Training Data
Once you can create a bag-of-words representation for a single text document, load the entire contents of the training directory into such Python dictionaries, label them with their corresponding subdirectory label (‘2016’ or ‘2020’) and return them in a list of length n=number of training documents.
Python’s os moduleLinks to an external site. will be helpful here, particularly its listdir()
function.
>>> vocab = create_vocabulary('./EasyFiles/', 1) >>> load_training_data(vocab,'./EasyFiles/') => [{'label': '2016', 'bow':{'a': 2, 'dog': 1, 'chases': 1, 'cat': 1, '.': 1}}, {'label': '2016', 'bow':{'hello': 1, 'world': 1}}, {'label': '2020', 'bow':{'it': 1, 'is': 1, 'february': 1, '19': 1, ',': 1, '2020': 1, '.': 1}}]
The dictionaries in this list do not need to be in any particular order. You may assume that the directory string will include a trailing ‘/’ as shown here.
NOTE: All subsequent functions which have a training_data
parameter should expect it to be in the format of the output of this function, a list of two-element dictionaries with a label and a bag-of-words.
Prior Log Probability
This method should return the log probability of the labels in the training set In order to calculate these, you will need to count the number of documents with each label in the training data, found in the
training/
subdirectory.
>>> vocab = create_vocabulary('./corpus/training/', 2) >>> training_data = load_training_data(vocab,'./corpus/training/') >>> prior(training_data, ['2020', '2016']) => {'2020': -0.31939049933692143, '2016': -1.2967892172518587}
Because we usually have enough training documents in each class, we do not use add-1 smoothing here; instead we use the maximum likelihood estimate (MLE). Note that the return values are the natural log of the probability. In a Naive Bayes implementation, we must contend with the possibility of underflow: this can occur when we take the product of very small floating point values. As such, all our probabilities in this program will be log probabilities, to avoid this issue.
Log Probability of a word, Given a label
This function returns a list consisting of the log conditional probability of all word types in a vocabulary (plus OOV) given a particular class label, log . To compute this probability, you must use add-1 smoothing, rather than the MLE (this is different from the prior) to avoid zero probability.
>>> vocab = create_vocabulary('./EasyFiles/', 1) >>> training_data = load_training_data(vocab, './EasyFiles/') >>> p_word_given_label(vocab, training_data, '2020') => {'a': -3.04, 'dog': -3.04, 'chases': -3.04, 'cat': -3.04, '.': -2.35, 'hello': -3.04, 'world': -3.04, 'it': -2.35, 'is': -2.35, 'february': -2.35, '19': -2.35, ',': -2.35, '2020': -2.35, None: -3.04} >>> p_word_given_label(vocab, training_data, '2016') => {'a': -1.99, 'dog': -2.4, 'chases': -2.4, 'cat': -2.4, '.': -2.4, 'hello': -2.4, 'world': -2.4, 'it': -3.09, 'is': -3.09, 'february': -3.09, '19': -3.09, ',': -3.09, '2020': -3.09, None: -3.09}
(I’ve rounded the float values to two decimal places for readability here; in reality they are much longer.)
Note: In this simple case, we have no words in our training set which are out-of-vocabulary. With the cutoff of 2 in the real set, we will see a number of words in the training set which are still out-of-vocabulary and map to None
. These counts should be used when calculating , and the existence of an “out-of-vocabulary” word type should be used when calculating all probabilities.
Train
Given the location of the training directory and a cutoff value for the training set vocabulary, use the previous set of helper functions to create the following trained model structure in a Python dictionary:
{ 'vocabulary': <the training set vocabulary>, 'log prior': <the output of prior()>, 'log p(w|y=2016)': <the output of p_word_given_label() for 2016>, 'log p(w|y=2020)': <the output of p_word_given_label() for 2020> }
For the EasyFiles data and a cutoff of 2, this would give (formatted for readability):
>>> train('./EasyFiles/', 2) => { 'vocabulary': ['.', 'a'], 'log prior': {'2016': -0.41, '2020': -1.10}, 'log p(w|y=2016)': {'a': -1.30, '.': -1.70, None: -0.61}, 'log p(w|y=2020)': {'a': -2.30, '.': -1.61, None: -0.36} }
The values for None
are so high in this case because the majority of our training words are below the cutoff and are therefore out-of-vocabulary.
Classify
Given a trained model, this function will analyze a single test document and give its prediction as to the label for the document. The return value for the function must have the Python dictionary format
{ 'predicted y': <'2016' or '2020'>, 'log p(y=2016|x)': <log probability of 2016 label for the document>, 'log p(y=2020|x)': <log probability of 2020 label for the document> }
Recall that the label for a test document x is the argmax of the following estimate, as defined in lecture:
(Recall: use log probabilities! When you take the log of a product, you should sum the logs of the operands.)
>>> model = train('./corpus/training/', 2) >>> classify(model, './corpus/test/2016/0.txt') => {'log p(y=2020|x)': -3906.35, 'log p(y=2016|x)': -3916.46, 'predicted y': '2020'}
The documents in the test/
subdirectory are for testing your classifier. How many are classified correctly??
Reviews
There are no reviews yet.