[Solved] CS584-Assignment 2-Word Vectors

$25

File Name: CS584-Assignment_2-Word_Vectors.zip
File Size: 292.02 KB

SKU: [Solved] CS584-Assignment 2-Word Vectors Category: Tag:
5/5 - (1 vote)

Homework assignments will be done individually: each student must hand in their own answers. Use of partial or entire solutions obtained from others or online is strictly prohibited. Electronic submission on Canvas is mandatory.

  1. Basics (15 points)
    • (5pts)Prove that softmax is invariant to constant offset in the input, that is, for any input vector x and any constant c,

softmax(x) = softmax(x + c)

where x + c means adding the constant c to every dimension of x. Remember that

softmax(x)

Note: In practice, we make use of this property and choose c = maxi xi when computing softmax probabilities for numerical stability (i.e., subtracting its maximum element from all elements of x).

  • (5pts)Given an input matrix of N rows and D columns, compute the softmax prediction for each row using the optimization in part (a). Write your implementation in py. You may test by executing softmax.py
  • (5pts)Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e., in some expression where only (x), but not x, is present). Assume that the input x is a scalar for this question. Recall, the sigmoid function is:

Implement the sigmoid function in sigmoid.py and test your code.

  1. Word2vec (85 points)
    • (5pts) Assume you are given a predicted word vector vc corresponding to the center word c for skipgram, and the word prediction is made with the softmax function

where o is the expected word, w denotes the w-th word and uw (w = 1, , W) are the output word vectors for all words in the vocabulary. The cross entropy function is defined as:

JCE(o,vc,U) = CE(y,y) = Xyi log(yi)

i

where the gold vector y is a one-hot vector, the softmax prediction vector y is a probability distribution over the output space, and U = [u1,u2,,uW] is the matrix of all the output vectors. Assume cross entropy cost is applied to this prediction, derive the gradients with respect to vc.

  • (5pts)As in the previous part, derive gradients for the output word vector uw (including uo).
  • (10pts)Repeat a and b assuming we are using the negative sampling loss for the predicted vector vc. Assume that K negative samples (words) are drawn and they are 1,,K respectively. For simplicity of notation, assume (o / {1,,K}). Again for a given word o, use uo to denote its output vector. The negative sampling loss function in this case is:

K

Jneg-sample(o,vc,U) = log((u>o vc)) Xlog((u>k vc))

k=1

  • (5pts)Derive gradients for all of the word vectors for skip-gram given the previous parts and given a set of context words [wordcm,,wordc,,wordc+m] where m is the context size. Denote the input and output word vectors for word k as vk and uk

Hint: feel free to use F(o,vc) (where o is the expected word) as a placeholder for the JCE(o,vc) or Jneg-sample(o,vc) cost functions in this part youll see that this is a useful abstraction for the coding part. That is, your solution may contain terms of the form Recall that for skip-gram, the cost for a context centered around c is:

X

F(wc+j,vc)

mjm,j6=0

  • (15pts) In this part you will implement the word2vec models and train your own word vectors with stochastic gradient descent (SGD). First, write a helper function to normalize rows of a matrix in py. In the same file, fill in the implementation for the softmax and negative sampling cost and gradient functions. Then, fill in the implementation of the cost and gradient functions for the skip-gram model. When you are done, test your implementation by running python word2vec.py.
  • (15pts) Complete the implementation for your SGD optimizer in py. Test your implementation by running python sgd.py.
  • (15pts) In this part you will implement the k-nearest neighbors algorithm, which will be used for analysis. The algorithm receives a vector, a matrix and an integer k, and returns k indices of the matrixs rows that are closest to the vector. Use the cosine similarity as a distance metric (https://en.wikipedia.org/wiki/Cosine_similarity). Implement the algorithm in py.
  • (15pts)Show time! Now we are going to load some real data and train word vectors with everything you just implemented!

We are going to use the StanfordSentimentTreeBank data to train word vectors. You need to process the dataset and use the sgd function and word2vec to generate word vectors. Visualize a few word examples. There is no additional code to write for this part; just run python run.py.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CS584-Assignment 2-Word Vectors
$25