1. The Kneser-Ney smoothing method approximates the probability of an n-gram that
has not been seen using the likelihood of the (n-1)-gram to occur in diverse contexts.If we want to finish the sentence “I want to go to the movie ” with the word
“theatre” but we have not observed “theatre” very often, then we want to make sure
that we can guess “theatre” based on its likelihood of combining with other words.The full formula for the bigram version of Kneser-Ney smoothing follows:
P(bigram) = discounted bigram probability + joint unigram probability (1)
P(wi
|wi−1) = max(C(wi−1, wi) − d, 0)
C(wi−1)
+ λ(wi−1)
|v : C(v, wi) > 0|
P
w0 |v : C(v, w0
) > 0|
(2)
λ(wi−1) = d
P
v C(wi−1, v)
∗ |w : C(wi−1w) > 0| (3)Assume that you have collected the data in the following tables, and assume that
all other observed counts are 0. In the bigram table, rows represent wi−1, columns
represent wi
: e.g. C(computer, keyboard) = 2.C(wi−1, wi) computer keyboard monitor store
computer 0 2 4 4
keyboard 1 0 0 1
monitor 0 1 1 1
store 2 0 0 0
Table 1: Bigram frequency. Rows = wi−1, columns = wi
.
computer 10
keyboard 3
monitor 6
store 5
Table 2: Unigram frequency.
Consider the following sentence fragment S: “I shopped at the computer ”.You need to determine whether the sentence is more likely to end with “computer
store” or “computer monitor.”
(a) Compute the raw bigram probabilities for the candidate words {store, monitor}
to complete the sentence S, i.e. P(store|computer) and P(monitor|computer). Is
one word more likely than the other, and if so which one? [2 pts](b) Compute the Kneser-Ney smoothed bigram probability of the candidate words
{store, monitor} to complete the sentence. Use d = 0.5 as the discount term.
Is one word more likely than the other, and if so which one? If the result has
changed, why do you think it changed? [5 pts](c) Change the discount term to d = 0.1 and re-compute the Kneser-Ney smoothed
bigram probability of the candidate words {store, monitor} to complete the
sentence. Is one word more likely than the other, and if so which one? If the
result has changed, why do you think it changed? [3 pts]
Solution.
Solution goes here. 2. Consider we have a term-document matrix for four words in three documents shown
in Table 3. The whole document set has N = 20 documents, and for each of the four
words, the document frequency dft
is shown in Table 4.
term-document Doc1 Doc2 Doc3
car 27 4 24
insurance 3 18 0
auto 0 33 29
best 14 0 17
Table 3: Term-document Matrix
df
car 12
insurance 6
auto 10
best 16
Table 4: Document Frequency(a) Compute the tf-idf weights for each word car, auto, insurance and best in Doc1,
Doc2, and Doc3. [6 pts]
(b) Use the tf-idf weight you get from (a) to represent each document with a vector
and calculate the cosine similarities between these three documents. [4 pts]
Solution.
Solution goes here. 3. The distributional hypothesis suggests that the more similarity there is in the meaning
of two words, the more distributionally similar they are, where a word’s distibution
refers to the context in which it appears. This motivated the work by Mikolov et al.
on the skip-gram model which is an efficient way of learning high quality dense vector
representations of words from unstructured text.The objective of the skip-gram model
is to learn the probability distribution P(O|I) where given an inside word wI , we intend
to estimate the probability that an outside word wO lies in the context window of wI .The basic formulation of the skip-gram model defines this using the softmax function:
P(O = wO|I = wI ) = exp(u
T
wO
.vwI
)
P
w∈Vocab exp(uT
w.vwI
)
(4)Here, uwO
is the word vector representing the outside word o and vwI
is the word
vector representing the inside word i. To update these parameters continually during
training, we store these in two matrices U and V. The columns of V are all of the
inside word vectors vwI while the columns of U are all the outside word vectors uwO
and both these matrices contain a vector for each word in the vocabulary.(a) The cross entropy loss between two probability distributions p and q, is expressed
as:
CE(p, q) = −
X
i
pi
log(qi) (5)For, a given inside word wI = wk, if we consider the ground truth distribution y
to be a one-hot vector (of length same as the size of vocabulary) with a 1 only for
the true outside word wO and 0 everywhere else. The predicted distribution yˆ (of
length same as the size of vocabulary) is the probability distribution P(wO|wI =
wk).The i
th entry in these vectors is the probability of the i
th word being an
outside word. Write down and simplify the expression for the cross entropy loss,
CE(y, yˆ), for the skip-gram model described above for a single pair of words wO
and wI . (Note: your answer should be in terms of P(O = wO|I = wI ).) [2 pts](b) Find the partial derivative of the cross entropy loss calculated in part (a) with
respect to the inside word vector vwI
. (Note: your answer should be in terms of
y, yˆ and U.) [5 pts](c) Find the partial derivative of the cross entropy loss calculated in part (a) with
respect to each of the outside word vectors uwO
. (Note: Do this for both cases
wO = O (true outside word) and wO 6= O (all other words). Your answer should
be in terms of y, yˆ and vwI
.) [5 pts](d) Explain the idea of negative sampling and the use of the parameter K. Write
down the loss function for this case. (Note: your answer should be in terms of
uwO
, vwI
and the parameter K.) [3 pts]4. In the textbook, language modeling was defined as the task of predicting the next
word in a sequence given the previous words. In this assignment, you will implement
character-level, word-level N-gram and character-level RNN language models. You
need to both answer the questions and submit your code.You need to submit a
zip file to Canvas Homework 3 Programming. This file should contain ngram.ipynb,
rnn.ipynb hw3 skeleton char.py and hw3 skeleton word.py.You should also submit a zip file to the Gradescope assignment HW3 Language Models.
This file should contain hw3 skeleton char.py and hw3 skeleton word.py.
(a) For N-gram language models, You should complete two scripts hw3 skeleton char.py
and hw3 skeleton word.py. Detailed instructions can be found in ngram.ipynb.
You should also use test cases in ngram.ipynb and use ngram.ipynb to get
development results for (c).You need to submit a zip file to Gradescope HW3 Language Models. This file
should contain hw3 skeleton char.py and hw3 skeleton word.py. You can see
the scores for your code there. Character-level N-gram language models accounts
for 20 points. Word-level N-gram language models accounts for 10 points, which
are bonus for CS 4650. [30pts for CS 7650, 20 pts + bonus 10pts for CS 4650](b) See the generation results of your character-level and word-level N-gram language
models respectively (n ≥ 1). The paragraphs which character-level N-gram
language models generate all start with F. The paragraphs which word-level
N-gram language models generate all start with In. Did you get such results?
Explain what is going on. (CS 4650 can only focus on character-level N-gram
language model.) [2 pts](c) (Bonus for CS 4650) Compare the generation results of character-level and word-level
N-gram language models. Which do you think is better? Compare the perplexity
of nytimes article.txt when using character-level and word-level N-gram language
models. Explain what you found. [2pts, bonus for CS 4650](d) When you compute perplexity, you can play with different sets of hyper-parameters
in both character-level and word-level N-gram language models. You can tune n ,
k and λ. Please report here the best results and the corresponding hyper-parameters
in development sets. For character-level N-gram language models, the development
set is shakespeare sonnets.txt. For word-level N-gram language models, the
development sets are shakespeare sonnets.txt and val e.txt. (CS 4650 should
only focus on character-level N-gram language model.) [6 pts for CS 7650, 2 pts
+ bonus 4 pts for CS 4650](e) For RNN language models, You should complete the forward method of Class
RNN in rnn.ipynb. You need to figure out the code and tune the hyperparameters.
You should also copy a paragraph generated by your model and report the perplexity
on the development set shakespeare sonnets.txt. Compare the results of character-level
RNN language model and character-level N-gram language model. [10 pts]
4650/7650, Homework, language, Modeling, solved
[SOLVED] Homework 3: language modeling cs 4650/7650
$25
File Name: Homework_3__language_modeling_cs_4650_7650.zip
File Size: 395.64 KB
Reviews
There are no reviews yet.