In this homework, you need to classify text paragraphs into
three categories: Fyodor Dostoyevsky, Arthur Conan Doyle, and Jane Austen by building your own
classifiers. The data provided is from Project Gutenberg.
Please follow the steps below:
• (5pts) Preprocess data: Remove punctuation, irrelevant symbols, urls, and numbers. You can
remove the unrelated text in the beginning of each file.
• (5pts) Construct examples: Divide each document into multiple paragraphs. Each paragraph
will be one example. Text that is not part of a paragraph can be discarded or preprocessed. Report
the total number of examples for each category.
• (5pts) Data split: Sample these paragraphs into training and testing data.
• (5pts) Feature extraction: Build a vocabulary to represent each paragraph using only training
data. Consider TF-IDF features for each input example.
1. Implement a Logistic Regression (LR) model with L2 regularization from scratch:
J = −
1
N
X
N
i=1
X
K
k=1
yik log exp fk
PK
c=1 exp fc
+ λ
X
d
j=1
w
2
kj
– (10 pts) Given this formula, show the steps to derive the gradient of J with respect to wk.
– (20 pts) Write a function for mini-batch gradient descent.
– (20 pts) Write a function for stochastic gradient descent.
Report the results and plots for both mini-batch and stochastic gradient descent.
2. (10 pts) Build and train a Multilayer Perceptron (MLP) model (i.e., a two-layer neural network)
using backpropagation. Please specify the settings of the model such as the network structure,
the optimizer, the initial learning rate, the loss function.
• (5pts) Plot training loss and validation loss every 100 epoches.
• (5pts) Use cross-validation on the training data, report the recall and precision for each category
on the test and validation sets, choose the best λ(in LR) and the number of neurons in the hidden
layer (in MLP) using the validation set.
• (10pts) Compare both classifiers and provide an analysis for the results.
Please follow the below instructions when you submit the assignment.
1. You are allowed to use packages for reading text, TF-IDF, plotting, and MLP, but you are not
allowed to use packages for mini-batch gradient descent or stochastic gradient descent.
2. Your submission should consist of a zip file named Assignment1 LastName FirstName.zip which
contains:
• a jupyter notebook file(.ipynb). The file should contain the code and the output after execution.
You should also include detailed comments and analysis (plots, result tables, etc).
• a pdf (or jpg) file to show the derivation steps of the gradient of J with respect to wk.
(a) Softmax (5pts) Prove that softmax is invariant to constant offset in the input, that is, for any
input vector x and any constant c,
softmax(x) = softmax(x + c)
where x + c means adding the constant c to every dimension of x. Remember that
softmax(x)i =
e
xi
P
j
e
xj
Note: In practice, we make use of this property and choose c = − maxi xi when computing softmax
probabilities for numerical stability (i.e., subtracting its maximum element from all elements of x).
(b) Sigmoid (5pts)Derive the gradients of the sigmoid function and show that it can be rewritten as
a function of the function value (i.e., in some expression where only σ(x), but not x, is present).
Assume that the input x is a scalar for this question. Recall, the sigmoid function is:
σ(x) = 1
1 + e−x
(c) Word2vec
i. (5pts) Assume you are given a predicted word vector vc corresponding to the center word c for
skipgram, and the word prediction is made with the softmax function
yˆo = p(o|c) = exp(u
>
o vc)
PW
w=1 exp(u>
w vc)
where o is the expected word, w denotes the w-th word and uw (w = 1, …, W) are the “output”
word vectors for all words in the vocabulary.
The cross entropy function is defined as:
JCE(o, vc, U) = CE(y, yˆ) = −
X
i
yi
log(ˆyi)
where the gold vector y is a one-hot vector, the softmax prediction vector yˆ is a probability
distribution over the output space, and U = [u1, u2, …, uW ] is the matrix of all the output
vectors.
Assume cross entropy cost is applied to this prediction, derive the gradients with
respect to vc.
ii. (5pts) As in the previous part, derive gradients for the “output” word vector uw (including
uo).
iii. (5pts) Repeat a and b assuming we are using the negative sampling loss for the predicted vector
vc. Assume that K negative samples (words) are drawn and they are 1,…,K respectively. For
simplicity of notation, assume (o /∈ {1, …, K}).
Again for a given word o, use uo to denote its
output vector.
The negative sampling loss function in this case is:
Jneg-sample(o, vc, U) = − log(σ(u
>
o vc)) −
X
K
k=1
log(σ(−u
>
k vc))
iv. (5pts) Derive gradients for all of the word vectors for skip-gram given the previous parts and
given a set of context words [wordc−m, …, wordc, …, wordc+m] where m is the context size.
Denote the “input” and “output” word vectors for word k as vk and uk respectively.
Hint: feel free to use F(o, vc) (where o is the expected word) as a placeholder for the JCE(o, vc…)
or Jneg-sample(o, vc…) cost functions in this part – you’ll see that this is a useful abstraction for
the coding part.
That is, your solution may contain terms of the form ∂F (o,vc)
∂… Recall that for
skip-gram, the cost for a context centered around c is:
X
−m≤j≤m,j6=0
F(wc+j , vc)
(a) (5pts) Given an input matrix of N rows and D columns, compute the softmax prediction for each
row using the optimization in 1.(a). Write your implementation in utils.py.
(b) (5pts) Implement the sigmoid function in word2vec.py and test your code.
(c) (45pts)Implement the word2vec models with stochastic gradient descent (SGD).
i. (15pts) Write a helper function normalizeRows in utils.py to normalize rows of a matrix in
word2vec.py. In the same file, fill in the implementation for the softmax and negative sampling
cost and gradient functions. Then, fill in the implementation of the cost and gradient functions
for the skip-gram model. When you are done, test your implementation by running python
word2vec.py.
ii. (15pts) Complete the implementation for your SGD optimizer in sgd.py. Test your implementation by running python sgd.py.
iii. (15pts) Implement the k-nearest neighbors algorithm, which will be used for analysis. The
algorithm receives a vector, a matrix and an integer k, and returns k indices of the matrix’s
rows that are closest to the vector.
Use the cosine similarity as a distance metric (https:
//en.wikipedia.org/wiki/Cosine_similarity). Implement the algorithm in knn.py. Print
out 10 examples: each example is one word and its neighbors.
(d) (15pts)Load some real data and train your own word vectors.
Use the StanfordSentimentTreeBank data to train word vectors. Process the dataset and use the
sgd function and word2vec to generate word vectors. Visualize a few word examples. There is no
additional code to write for this part; just run python run.py.
Note: The training process may take a long time depending on the efficiency of your implementation.
Plan accordingly! When the script finishes, a visualization for your word vectors will appear. It will
also be saved as word vectors.png in your project directory.
In addition, the script should print
the nearest neighbors of a few words (using the knn function you implemented in 2(g)). Include the
plot and the nearest neighbors lists in your homework write up, and briefly explain those results.
3. Submission Instructions You shall submit a zip file named Assignment2 LastName FirstName.zip
which contains:
• a pdf(or jpg) file contains all your solutions for the Written part
• a pdf(or jpg) file contains the word vector plot(vector.png), a brief report of your knn results.
• python files(sgd.py, word2vec.py, run.py, knn.py, utils.py)
You will need to use the Penn Treebank corpus for this assignment. Four data files are provided: train.txt,
train.5k.txt, valid.txt, and input.txt.
You can use train.txt to train your models and use valid.txt for testing.
File input.txt can be used for a sanity check on whether the model produces coherent sequences of words for
unseen data with no next word.
(a) (10 pts) Preprocess the train and validation data, build the vocabulary, tokenize, etc.
(b) (10 pts) Implement an N-gram model (bigram or trigram) for language modeling.
(c) (10 pts) Implement Good Turing smoothing.
(d) (10 pts) Implement Kneser-Ney Smoothing using:
PKN(wi
|wi−1) = max(c(wi−1, wi) − d, 0)
c(wi−1)
+ λ(wi−1)PCONTINUATION(wi)
where
λ(wi−1) = d
c(wi−1)
|{w : c(wi−1, w) > 0}|
PCONTINUATION(w) = |wi−1 : c(wi−1, w) > 0|
P
w0 |{w0
i−1
: c(w0
i−1
, w0) > 0}|
(e) (5 pts) Predict the next word in the valid set using a sliding window. Report the perplexity scores
of N-gram, Good Turing, and Kneser-Ney on the test set.
(f) (10 pts) There are 3124 examples in input.txt. Choose the first 30 lines and print the predictions
of next words using your N-gram model.
(a) (5 pts) Initialize parameters for the model.
(b) (10 pts) Implement the forward pass for the model. Use an embedding layer as the first layer of your
network (e.g. tf.nn.embedding lookup). Use a recurrent neural network cell (GRU or LSTM) as
the next layer. Given a sequence of words, predict the next word.
(c) (5 pts) Calculate the loss of the model (sequence cross-entropy loss is suggested)
e.g. tf.contrib.seq2seq.sequence loss.
(d) (5 pts) Set up the training step: use a learning rate of 1e − 3 and an Adam optimizer. Set window
size to be 20 and batch size to be about 50.
(e) (10 pts) Train your RNN model. Calcuate the model’s perplexity on the test set. Prove that
perplexity is exp
total loss
number of predictions
(f) (10 pts) Print the predictions of next words in the same 30 lines of input.txt as in N-gram.
You shall submit a zip file named Assignment3 LastName FirstName.zip
which contains:
• python files (.ipynb or .py) including all the code, plots and results. You need to provide detailed
comments in English.
This assignment focuses on convolutional neural networks. You will need to implement convolutional neural network models for two tasks: document classification and sentimental analysis.
Use the same datasets as Assignment 1. Classify text paragraphs into three categories: Fyodor Dostoyevsky, Arthur Conan Doyle, and Jane Austen by building your own classifiers.
The data provided is from Project Gutenberg.
(a) (10 pts) Preprocess the data: build the vocabulary, tokenize, etc. Divide the data into train, validation, and test.
(b) (10 pts) Initialize parameters for the model. Implement the forward pass for the model. Use an embedding layer as the first layer of your network (e.g. tf.nn.embedding lookup). Set zero paddings to the input matrix. Use at least two convolutional layers (each layer includes convolution, activation, and maxpooling).
(c) (10 pts) Choose and report the number of filters and the filter size for your CNN.
(d) (10 pts) Calculate the loss of the model (cross-entropy loss is suggested). Set up the training step: use a learning rate of 1e − 3 and an Adam optimizer.
(e) (10 pts) Train you model and report the recall and precision of each class on test data. Tune the parameters to achieve the best performance you can.
This is a multi-domain sentiment dataset with positive or negative sentiment annotations. We only use the book reviews for this assignment.
There are 1000 positive book reviews and 1000 negative book reviews.
(a) (10 pts) Preprocess the data: extract the review text from , build the vocabulary, tokenize, etc. Divide the data into train, validation, and test.
(b) (10 pts) Initialize parameters for the model. Implement the forward pass for the model. Use an embedding layer as the first layer of your network (e.g. tf.nn.embedding lookup). Set zero paddings to the input matrix. Use at least two convolutional layers (each layer includes convolution, activation, and maxpooling).
(c) (10 pts) Choose and report the number of filters and the filter size for your CNN.
(d) (10 pts) Calculate the loss of the model (binary cross-entropy loss is suggested). Choose appropriate output function. Set up the training step including learning rate and optimizer.
(e) (10 pts) Train you model and report the accuracy of each class and the total accuracy on test data. Tune the parameters to achieve the best performance you can.
You shall submit a zip file named Assignment4 LastName FirstName.zip which contains: (Those who do not follow this naming policy will receive penalty points) • python files (.py) including all the code, comments and results.
You need to provide detailed comments in English. • report(.pdf) for each task: Describe your model: size of the training set and validation set, parameters for your model, number of filters, filter size for you CNN model, loss function, learning rate, optimizer, etc. Plot for training and validation loss.
Report recall and precision for task 1, and accuracy score for task 2 on test data. Further Reading: • Yoon Kim. Convolutional Neural Networks for Sentence Classification. ACL 2014. arXiv:1408.5882 • Ye Zhang, Byron Wallace. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification. arXiv:1510.03820
Reviews
There are no reviews yet.