[Solved] Sentiment Analysis CS221 HW2

$25

File Name: Sentiment_Analysis_CS221_HW2.zip
File Size: 263.76 KB

SKU: [Solved] Sentiment Analysis CS221 HW2 Category: Tag:
5/5 - (1 vote)

Advice for this homework:

  1. Words are simply strings separated by whitespace. Note that words which only differ in capitalization are considered seperate (e.g. great and Great are unique words).
  2. You might find some useful functions in util.py. Have a look around in there before you start coding.

Here are two reviews of Frozen, courtesy of Rotten Tomatoes (no spoilers!):

Rotten Tomatoes has classified these reviews as positive and negative,, respectively, as indicated by the intact tomato on the left and the splattered tomato on the right. In this assignment, you will create a simple text classification system that can perform this task automatically.

    1. Well warm up with the following set of four mini-reviews, each labeled positive (

+1

    1. ) or negative (

1

    1. ):

      1. (1) pretty bad
      2. (+1) good plot
      3. (1) not good
      4. (+1) pretty scenery
    2. Each review

x

    1. is mapped onto a feature vector

(x)

    1. , which maps each word to the number of occurrences of that word in the review. For example, the first review maps to the (sparse) feature vector

(x)={pretty:1,bad:1}

    1. . Recall the definition of the hinge loss:

    1. where

y

    is the correct label.

  1. [2 points] Suppose we run stochastic gradient descent, updating the weights according toonce for each of the four examples in order. After the classifier is trained on the given four data points, what are the weights of the six words (pretty, good, bad, plot, not, scenery) that appear in the above reviews? Use =1 as the step size and initialize w=[0,...,0]. Assume that wLosshinge(x,y,w)=0 when the margin is exactly 1.
  2. [2 points] Create a small labeled dataset of four mini-reviews using the words not, good, and bad, where the labels make intuitive sense. Each review should contain one or two words, and no repeated words. Prove that no linear classifier using word features can get zero error on your dataset.Remember that this is a question about classifiers, not optimization algorithms; your proof should be true for any linear classifier, regardless of how the weights are learned.After providing such a dataset, propose a single additional feature that we could augment the feature vector with that would fix this problem. (Hint: think about the linear effect that each feature has on the classification score.)

Suppose that we are now interested in predicting a numeric rating for movie reviews. We will use a non-linear predictor that takes a movie review x and returns (w(x)), where (z)=(1+ez)1 is the logistic function that squashes a real number to the range [0,1]. Suppose that we wish to use the squared loss. For this problem, assume that the movie rating y is a real-valued variable in the range (0,1).

  1. [2 points] Write out the expression for Loss(x,y,w).
  2. [3 points] Compute the gradient of the loss.Hint: you can write the answer in terms of the predicted value p=(w(x)).
  3. [3 points] Assuming y=1, what is the smallest magnitude that the gradient can take? That is, find a way to set w to make Loss(x,y,w) as small as possible. You are allowed to let the magnitude of w go to infinity. Hint: try to understand intuitively what is going on and the contribution of each part of the expression. If you find doing too much algebra, youre probably doing something suboptimal.Motivation: the reason why were interested in the magnitude of the gradients is because it governs how far gradient descent will step. For example, if the gradient is close to zero when w is very far from the origin, then it could take a long time for gradient descent to reach the optimum (if at all). This is known as the vanishing gradient problem when training neural networks.
  4. [3 points] Assuming y=1, what is the largest magnitude that the gradient can take? Leave your answer in terms of (x).
  5. [3 points] The problem with the loss function we have defined so far is that is it is non-convex, which means that gradient descent is not guaranteed to find the global minimum, and in general these types of problems can be difficult to solve. So let us try to reformulate the problem as plain old linear regression. Suppose you have a dataset D consisting of (x,y) pairs, and that there exists a weight vector w that yields zero loss on this dataset. Show that there is an easy transformation to a modified dataset D of (x,y) pairs such that performing least squares regression (using a linear predictor and the squared loss) on D converges to a vector w that yields zero loss on D. Concretely, write an expression for y in terms of y and justify this choice. This expression should not be a function of w.

In this problem, we will build a binary linear classifier that reads movie reviews and guesses whether they are positive or negative. In this problem, you must implement the functions without using libraries like Scikit-learn.

  1. [2 points] Implement the function extractWordFeatures, which takes a review (string) as input and returns a feature vector (x) (you should represent the vector (x) as a dict in Python).
  2. [4 points] Implement the function learnPredictor using stochastic gradient descent and minimize the hinge loss. Print the training error and test error after each iteration to make sure your code is working. You must get less than 4% error rate on the training set and less than 30% error rate on the dev set to get full credit.
  3. [2 points] Create an artificial dataset for your learnPredictor function by writing the generateExample function (nested in the generateDataset function). Use this to double check that your learnPredictor works!
  4. [2 points] When you run the grader.py on test case 3b-2, it should output a weights file and a error-analysis file. Look through some example incorrect predictions and for five of them, give a one-sentence explanation of why the classification was incorrect. What information would the classifier need to get these correct? In some sense, theres not one correct answer, so dont overthink this problem. The main point is to convey intuition about the problem.
  5. [2 points] Now we will try a crazier feature extractor. Some languages are written without spaces between words. So is splitting the words really necessary or can we just naively consider strings of characters that stretch across words? Implement the function extractCharacterFeatures (by filling in the extract function), which maps each string of n characters to the number of times it occurs, ignoring whitespace (spaces and tabs).
  6. [3 points] Run your linear predictor with feature extractor extractCharacterFeatures. Experiment with different values of n to see which one produces the smallest test error. You should observe that this error is nearly as small as that produced by word features. How do you explain this?Construct a review (one sentence max) in which character n-grams probably outperform word features, and briefly explain why this is so.

Suppose we have a feature extractor that produces 2-dimensional feature vectors, and a toy dataset Dtrain={x1,x2,x3,x4} with

  1. (x1)=[0,0]
  2. (x2)=[0,2]
  3. (x3)=[2,0]
  4. (x4)=[1,2]
  1. [2 points] Run 2-means on this dataset until convergence. Please show your work. What are the final cluster assignments z and cluster centers ? Run this algorithm twice with the following initial centers:
    1. 1=[1,3] and 2=[1,1]
    2. 1=[1,1] and 2=[2,2]
  2. [5 points] Implement the kmeans function. You should initialize your k cluster centers to random elements of examples. After a few iterations of k-means, your centers will be very dense vectors. In order for your code to run efficiently and to obtain full credit, you will need to precompute certain quantities. As a reference, our code runs in under a second on Myth, on all test cases. You might find generateClusteringExamples in util.py useful for testing your code. In this problem, you are not allowed to use libraries like Scikit-learn.
  3. [5 points] Sometimes, we have prior knowledge about which points should belong in the same cluster. Suppose we are given a set S of example pairs (i,j) which must be assigned to the same cluster. For example, suppose we have 5 examples; then S={(1,5),(2,3),(3,4)} says that examples 2, 3, and 4 must be in the same cluster and that examples 1 and 5 must be in the same cluster. Provide the modified k-means algorithm that performs alternating minimization on the reconstruction loss.Recall that alternating minimization is when we are optimizing two variables jointly by alternating which variable we keep constant.

5/5 – (2 votes)

This (and every) assignment has a written part and a programming part.

The full assignment with our supporting code and scripts can be downloaded as sentiment.zip.

  1. This icon means a written answer is expected in sentiment.pdf.
  2. This icon means you should write code in submission.py.

You should modify the code in submission.py between

# BEGIN_YOUR_CODE

and

# END_YOUR_CODE

but you can add other helper functions outside this block if you want. Do not make changes to files other than submission.py.Your code will be evaluated on two types of test cases, basic and hidden, which you can see in grader.py. Basic tests, which are fully provided to you, do not stress your code with large inputs or tricky corner cases. Hidden tests are more complex and do stress your code. The inputs of hidden tests are provided in grader.py, but the correct outputs are not. To run the tests, you will need to have graderUtil.py in the same directory as your code and grader.py. Then, you can run all the tests by typing

python grader.py

This will tell you only whether you passed the basic tests. On the hidden tests, the script will alert you if your code takes too long or crashes, but does not say whether you got the correct output. You can also run a single test (e.g., 3a-0-basic) by typing

python grader.py 3a-0-basic

We strongly encourage you to read and understand the test cases, create your own test cases, and not just blindly run grader.py.


General Instructions

Problem 1: Warmup

Losshinge(x,y,w)=max{0,1w(x)y},

wwwLosshinge(x,y,w),

Problem 2: Predicting Movie Ratings

Problem 3: Sentiment Classification

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] Sentiment Analysis CS221 HW2
$25