Assignment 5Sentiment Analysis
In this assignment you will use a data set of movie reviews and experiment with the Nave Bayes and decision tree classifiers in Scikit Learn.
Deadline: December 2nd.
Put your code in a file named maina5.py. You could put all your code in there, but you may also use as many auxiliary files as you want.
We should be able to use your code to train classifiers as follows:
python3 maina5.py train
Your code should create a bunch of classifiers and write them to the classifiers directory, again using Pythons pickle module or the joblib module, which is better with large numpy arrays sklearn makes heavy use of numpy. Write progress notes and time elapsed to generate your model. Also write for each classifier what accuracy it operates at. So we would see something like:
python3 maina5.py train
Creating Bayes classifier in classifiersbayesallwords.jbl
Elapsed time: 4s
Accuracy: 0.89
Creating Bayes classifier in classifiersbayessentiwordnet.jbl
Elapsed time: 6s
Accuracy: 0.84
.
.
.
We should also be able to run your code on a file that we provide:
python3 maina5.py run bayestree filename
Here, your code should ask what classifier model to use and then run the classifier over the given file, which will be a file in the same format as the data used to train your model. We do not care that much whether your code gives the right answer, we just want to see it run and we want it to print pos or neg to the standard output. For example:
python3 maina5.py run bayes datareviewexample.txt
pos
Data Handling
As our sentiment data we use version 2.0 of the polarity dataset compiled by Pang and Lee, available at http:www.cs.cornell.edupeoplepabomoviereviewdata. This data set contains 1000 positive reviews and 1000 negative reviews that are tokenized and lower cased. Follow the link to the readme file for more information. You should use NLTK to get access to those data see chapter 6 for how to do this.
Once you have the data, you first have to prepare it for the classifiers. Data preparation includes two steps:
1. generating the features
2. encoding the features
You should experiment with the following feature sets:
1. All words with raw counts or tfidf scores
2. All words but just as binary features
3. Only the words from SentiWordNet with positive or negative scores over 0.5
4. Only the words from the MPQA Subjectivity Lexicon
SentiWordNet is available in WordNet and can be used as follows:
from nltk.corpus import sentiwordnet as swn
happylistswn.sentisynsetshappy, a0
printhappy
happy.a.01: PosScore0.875 NegScore0.0
printpos:s neg:s obj:s
happy.posscore, happy.negscore, happy.objscore
pos:0.875 neg:0.0 obj:0.125
The MPQA Subjectivity lexicon will be part of the repository. It has about 8000 words associated with parts of speech and a subjectivity score. The lexicon is stored in datasubjectivityclueshltemnlp05 which has both a readme file and a data file. Lines in the data file look like:
typeweaksubj len1 word1abandoned pos1adj stemmed1n priorpolaritynegative
typeweaksubj len1 word1abandonment pos1noun stemmed1n priorpolaritynegative
Negation
For the first of the above feature sets you should also experiment with a version using the simple heuristic shown in class slides to be posted. Say we have a vector
door12 window6 liked4
and liked was negated in one of those 4 cases, then the vector should be
door12 window6 liked3 NOTliked1
Encoding
It is up to you how to encode this, but chances are that you want to use either the OrdinalEncoder or the OneHotEncoder. Encoding examples that were shown in class will be posted.
Classifier
When you have created your feature sets you should partition them into a training set and a test set. Create the training sets when we do
python3 maina4.py train
Load and use the saved model when we do
python3 maina4.py run bayestree filename
How will this be graded?
There are no unit tests. But we will run the code and see whether your classifiers reach a minimum level of accuracy level to be determined. This will be 50 of your grade. In addition you will be pep8ed and your code will be graded on a scale from 05.
Reviews
There are no reviews yet.