lecture06.pptx
LECTURE 6
Vector Representaton and Models for Word Embeddings
Arkaitz Zubiaga, 24
th
January, 2018
2
Vector space models for language representaton.
Word embeddings.
SVD: Singular Value Decompositon.
Iteraton based models.
CBOW and skip-gram models.
Word2Vec and Glove.
LECTURE 6: CONTENTS
VECTOR SPACE MODELS
4
Goal: compute the probability of a sequence of words :
P(W) = P(w
1
, w
2
, w
3
, w
4
, w
5
,, w
n
)
Related task: probability of an upcoming word:
P(w
5
| w
1
, w
2
, w
3
, w
4
)
Both of the above are language models.
RECAP: STATISTICAL LANGUAGE MODELS
5
So far, we have viewed words as (sequences of) atomic symbols .
We have used edit distance to compute similarity.
N-grams & LMs what may follow/precede the word?
But this doesnt tell us anything about semantc similarity , e.g.:
Is Chinese closer to Asian or to English?
Are king & queen more related than doctor &
mountain?
WORDS AS ATOMIC SYMBOLS
6
We may identfy signifcant similarity based on word overlap between:
Facebook to fght fake news by asking users to rank trust in media outlets
Facebooks latest fx for fake news: ask users what they trust
But well fail when there isnt an overlap:
Zuckerberg announces new feature that crowdsources trustworthiness of news
organisatons
WORDS AS ATOMIC SYMBOLS
NO OVERLAP
Using stemmer/lemmatiser
7
Likewise for text classifcaton, e.g.:
If classifer learns that:
Leicester will welcome back Jamie Vardy for their Premier League clash with Watord
belongs to the class/topic sport
Well fail to classify the following also as sport:
Blind Cricket World Cup: India beat Pakistan by two wickets in thrilling fnal to retain ttle
WORDS AS ATOMIC SYMBOLS
NO OVERLAP
8
Assumptons:
We can represent words as vectorsof some dimension.
There is some N-dimensional spaceenough for encoding
language semantcs.
Each dimension has some semantc meaning , unknown a
priori, but could be e.g.:
Whether it is an object/concept/person.
Gender of person.
WORD VECTORS OR EMBEDDINGS
9
Word represented as: vector, |V| = vocabulary size
WORD VECTORS: ONE-HOT OR BINARY MODEL
10
Word represented as: vector, |V| = vocabulary size
Stll no noton of similarity, e.g.:
Solutonn reduce dimensionality of vector space.
WORD VECTORS: ONE-HOT OR BINARY MODEL
11
Bag-of-wordsn v = {|w
1
|, |w
2
|, , |w
n
|}
Toy example: hello world hello world hello I like chocolate
v = {2, 3, 1, 1, 1}
Widely used, but largely being replaced by word embeddings .
Conn inefcient for large vocabularies.
Conn doesnt capture semantcs (each word is an unrelated token)
BAG-OF-WORDS MODEL
12
Given as input:
A text/corpus.
An offset (e.g. 5 words)
In a co-occurrence matrix with |V| rows, |V| columns:
The (i, j)th value indicates the number of tmes words i and j
co-occur within the given offset .
BUILDING A CO-OCCURRENCE MATRIX
13
Examples ( = 2 words):
We need to tackle fake news to keep society informed.
How can we build a classifer to deal with fake news?
Fake co-occurs with: to(2), news(2), deal(1), tackle(1), with(1)
Deal (with) and tackle are diferent tokens for us.
Frequent occurrence in similar contexts will indicate similarity.
BUILDING A CO-OCCURRENCE MATRIX
14
The table will be huge (and sparse) for large |V|(vocabularies).
We need to reduce the dimensionality .
WORD EMBEDDINGS: WORD-WORD MATRIX
15
SVD: Singular Value Decompositon
We build co-occurrence matrix (|V|x|V|) with ofset .
We use SVD to decompose X as , where:
U(|V| x r) and V(|V| x r) are unitary matrices, and
S(r x r) is a diagonal matrix.
The columns of U (the lef singular vectors) are then the word
embeddings of the vocabulary .
WORD EMBEDDINGS: SVD METHODS
16
WORD EMBEDDINGS: SVD METHODS
We get |V| vectors of k dimensions each: word embeddings
e.g. word embedding of word w:
WE(w) = {v
1
, v
2
, , v
k
}
Weve reduced ws dimensionality from |V| to k.
17
SVD EXAMPLE IN PYTHON
= 1
like & I co-occur twice
18
PLOTTING SVD EXAMPLE IN PYTHON
Corpusn I like NLP. I like deep learning. I enjoy fying.
19
PLOTTING SVD EXAMPLE IN PYTHON
Corpusn I like NLP. I like deep learning. I enjoy fying.
NLP and deep arent directly
connected in the corpus ( is 1),
but have common context (like)
20
COMPUTING WORD SIMILARITY
Corpusn I like NLP. I like deep learning. I enjoy fying.
We can compute similarity
between w
i
and w
j
by comparing:
U[i, 0:k] and U[j, 0:k]
21
COMPUTING WORD SIMILARITY
Given 2 words w
1
and w
2
, similarity is computed as:
Dot/inner product, which equates:
|w
1
| * |w
2
| * cos()
Cosine of angle between vectors
Length of w
2
Length of w
1
22
COMPUTING WORD SIMILARITY
Given 2 words w
1
and w
2
, similarity is computed as:
Dot/inner product, which equates:
|w
1
| * |w
2
| * cos()
High similarity for:
near-parallel vectors with high values in same dimensions.
Low similarity for:
orthogonal vectors, low value vectors.
23
PROS AND CONS OF SVD
Pron has shown to perform well in a number of tasks.
Conn dimensions need to change as new words are added to the
corpus, costly.
Conn resultng vectors can stll be high dimensional and sparse.
Conn Quadratc cost to perform SVD.
ALTERNATIVES TO SVD
25
ALTERNATIVE: ITERATION BASED METHODS
Low dimensional, dense vectorsinstead of high dimensional,
sparse vectors.
Instead of computng co-occurrences from entre corpus, predict
surrounding words in a window of length c of every word.
Rely on a rule that can be updated.
This will be faster and can easily incorporatea new
sentence/document or add a word to the vocabulary
This is the idea behind word2vec (Mikolov 2013)
26
WORD2VEC: CBOW AND SKIPGRAM MODELS
Contnuous bag of words model (CBOW)nA language model
where we approximate a word from its lef and right context
within a window sized c.
i.e. from context to the word.
Skip gram modeln A language model where we approximate the
words surrounding a given a word within a window sized c to the
lef and right of the word.
i.e., from the word to the context.
CBOW and skip grams: reverse of each other.
27
WORD2VEC: WHY IS IT COOL?
They are very good for encoding similarity.
28
WORD2VEC: WHY IS IT COOL?
They are very good for inferring word relatonsn
v(Paris) v(France) + v(Italy) = v(Rome)
v(king) v(man) + v(woman) = v(queen)
29
PROS AND CONS: ITERATION BASED METHODS
Pron Do not need to operate on entre corpus which involves very
sparse matrices.
Pron Can capture semantc propertes of words as linear
relatonships between word vectors.
Pron Fast and can be easily updated with new sentences
(complexity in the order of O(|C|) ).
Conn Cant take into account the vast amount of repetton in the
data.
30
ANOTHER ALTERNATIVE: GLOVE
Glove (Pennington et al. 2014), is a count based method, does
dimensionality reducton, has similar performance to word2vec.
Does matrix factorisaton.
Can leverage repettons in the corpus as using the entre word
co-occurrence matrix.
How? Train on non-zero entries of a global word co-occurrence
matrix from a corpus, rather than the entre sparse matrix or
local word context.
31
GLOVE
Computatonally expensive for frst tme, then much fasteras
non-zero entries are much smaller than words in corpus.
The intuiton is that relatonships between words should be
explored in terms of the ratos of their co-occurrence
probabilites with some probe words k.
32
GLOVE: VISUALISATION
33
GLOVE: VISUALISATION
34
GLOVE: VISUALISATION
35
GLOVE: VISUALISATION
Want to play around?
htps://lamyiowce.github.io/word2viz/
36
PYTHON: USING WORD2VEC
Preparing the input:
Word2Vec takes lists of lists of words (lists of sentences) as input.
e.g.:
sentences = [[this, is, my, first, sentence],
[a, short, sentence],
[another, sentence],
[and, this, is, the, last, one]]
37
PYTHON: USING WORD2VEC
Training the model:
model = Word2Vec(sentences, min_count=10, size=300, workers=4)
We will only train vectors for
words occurring 10+ times in
the corpus
We want to produce word
vectors of 300 dimensions
We want to parallellise the task
running 4 processes
38
PYTHON: USING WORD2VEC
Its memory intensive!
It stores matrices: #vocabulary (dependent on min_count), #size
(size parameter) of floats (single precision aka 4 bytes).
Three such matrices are held in RAM. If you have:
100,000 unique words, size=200, the model will require approx.:
100,000*200*4*3 bytes = ~229MB.
39
PYTHON: USING WORD2VEC
Evaluation:
Its unsupervised, there is no intrinsic way of evaluating.
Extrinsic evaluation: test your model in a text classification,
sentiment analysis, machine translation, task!
Does it outperform other methods (e.g. bag-of-words)?
Compare two models A and B: which ones better?
40
PYTHON: USING WORD2VEC
Storing a model:
model = Word2Vec.load_word2vec_format(mymodel.txt, binary=False)
or
model = Word2Vec.load_word2vec_format(mymodel.bin.gz, binary=True)
Resuming training:
model = gensim.models.Word2Vec.load(mymodel.bin.gz)
model.train(more_sentences)
41
PYTHON: USING WORD2VEC
42
PYTHON: USING WORD2VEC
This will give us the vector representaton of computer:
v(computer) = {-0.00449447, -0.00310097, }
How do we then the vector representatons for sentences, e.g.:
I have installed Ubuntu on my computer
43
PYTHON: USING WORD2VEC
Vector representatons for sentences , e.g.:
I have installed Ubuntu on my computer
Standard practce is either of:
Summing word vectors (they have the same dimensionality):
v(I) + v(have) + v(installed) + v(Ubuntu) +
Getting the average of word vectorsn
(v(I) + v(have) + v(installed) + ) / 7
44
PRE-TRAINED WORD VECTORS
One can train a model from a large corpus (millions, if not billions
of sentences). Can be tme-consuming, memory-intensive.
Pre-trained models are available.
Remember to choose a suitable pre-trained model.
Dont use word vectors pre-trained from news artcles when
youre working with social media!
45
PRE-TRAINED WORD VECTORS
Gloves pre-trained vectors:
htps://nlp.stanford.edu/projects/glove/
46
PRE-TRAINED WORD VECTORS
Pre-trained word vectors for 30+ languages (from Wikipedia):
htps://github.com/Kyubyong/wordvectors
47
PRE-TRAINED WORD VECTORS
UK Twiter word embeddings:
htps://fgshare.com/artcles/UK_Twiter_word_embeddings_II_/5791650
48
REFERENCES
Gensim (word2vec):
https://radimrehurek.com/gensim/
Word2vec tutorial:
https://rare-technologies.com/word2vec-tutorial/
FastText:
https://github.com/facebookresearch/fastText/
GloVe: Global Vectors for Word Representation:
https://nlp.stanford.edu/projects/glove/
49
ASSOCIATED READING
Not yet part of Jurafskys book.
SeeDeep learning for NLP CS224d lectures 1 and 2.
http://cs224d.stanford.edu/syllabus.html
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013).
Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing
systems (pp. 3111-3119).
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global
vectors for word representation. In Proceedings of the 2014
conference on empirical methods in natural language processing
(EMNLP) (pp. 1532-1543).
Reviews
There are no reviews yet.