[SOLVED] python deep learning lecture06.pptx

$25

File Name: python_deep_learning_lecture06.pptx.zip
File Size: 329.7 KB

5/5 - (1 vote)

lecture06.pptx

LECTURE 6

Vector Representaton and Models for Word Embeddings

Arkaitz Zubiaga, 24
th

January, 2018

2

Vector space models for language representaton.

Word embeddings.

SVD: Singular Value Decompositon.

Iteraton based models.

CBOW and skip-gram models.

Word2Vec and Glove.

LECTURE 6: CONTENTS

VECTOR SPACE MODELS

4

Goal: compute the probability of a sequence of words :
P(W) = P(w

1
, w

2
, w

3
, w

4
, w

5
,, w

n
)

Related task: probability of an upcoming word:

P(w
5
| w

1
, w

2
, w

3
, w

4
)

Both of the above are language models.

RECAP: STATISTICAL LANGUAGE MODELS

5

So far, we have viewed words as (sequences of) atomic symbols .

We have used edit distance to compute similarity.

N-grams & LMs what may follow/precede the word?

But this doesnt tell us anything about semantc similarity , e.g.:

Is Chinese closer to Asian or to English?

Are king & queen more related than doctor &
mountain?

WORDS AS ATOMIC SYMBOLS

6

We may identfy signifcant similarity based on word overlap between:

Facebook to fght fake news by asking users to rank trust in media outlets

Facebooks latest fx for fake news: ask users what they trust

But well fail when there isnt an overlap:

Zuckerberg announces new feature that crowdsources trustworthiness of news
organisatons

WORDS AS ATOMIC SYMBOLS

NO OVERLAP

Using stemmer/lemmatiser

7

Likewise for text classifcaton, e.g.:

If classifer learns that:

Leicester will welcome back Jamie Vardy for their Premier League clash with Watord

belongs to the class/topic sport

Well fail to classify the following also as sport:

Blind Cricket World Cup: India beat Pakistan by two wickets in thrilling fnal to retain ttle

WORDS AS ATOMIC SYMBOLS

NO OVERLAP

8

Assumptons:

We can represent words as vectorsof some dimension.

There is some N-dimensional spaceenough for encoding
language semantcs.

Each dimension has some semantc meaning , unknown a
priori, but could be e.g.:

Whether it is an object/concept/person.

Gender of person.

WORD VECTORS OR EMBEDDINGS

9

Word represented as: vector, |V| = vocabulary size

WORD VECTORS: ONE-HOT OR BINARY MODEL

10

Word represented as: vector, |V| = vocabulary size

Stll no noton of similarity, e.g.:

Solutonn reduce dimensionality of vector space.

WORD VECTORS: ONE-HOT OR BINARY MODEL

11

Bag-of-wordsn v = {|w
1
|, |w

2
|, , |w

n
|}

Toy example: hello world hello world hello I like chocolate
v = {2, 3, 1, 1, 1}

Widely used, but largely being replaced by word embeddings .

Conn inefcient for large vocabularies.

Conn doesnt capture semantcs (each word is an unrelated token)

BAG-OF-WORDS MODEL

12

Given as input:

A text/corpus.

An offset (e.g. 5 words)

In a co-occurrence matrix with |V| rows, |V| columns:

The (i, j)th value indicates the number of tmes words i and j
co-occur within the given offset .

BUILDING A CO-OCCURRENCE MATRIX

13

Examples ( = 2 words):
We need to tackle fake news to keep society informed.

How can we build a classifer to deal with fake news?

Fake co-occurs with: to(2), news(2), deal(1), tackle(1), with(1)

Deal (with) and tackle are diferent tokens for us.
Frequent occurrence in similar contexts will indicate similarity.

BUILDING A CO-OCCURRENCE MATRIX

14

The table will be huge (and sparse) for large |V|(vocabularies).

We need to reduce the dimensionality .

WORD EMBEDDINGS: WORD-WORD MATRIX

15

SVD: Singular Value Decompositon

We build co-occurrence matrix (|V|x|V|) with ofset .
We use SVD to decompose X as , where:

U(|V| x r) and V(|V| x r) are unitary matrices, and

S(r x r) is a diagonal matrix.

The columns of U (the lef singular vectors) are then the word
embeddings of the vocabulary .

WORD EMBEDDINGS: SVD METHODS

16

WORD EMBEDDINGS: SVD METHODS

We get |V| vectors of k dimensions each: word embeddings
e.g. word embedding of word w:
WE(w) = {v

1
, v

2
, , v

k
}

Weve reduced ws dimensionality from |V| to k.

17

SVD EXAMPLE IN PYTHON

= 1

like & I co-occur twice

18

PLOTTING SVD EXAMPLE IN PYTHON

Corpusn I like NLP. I like deep learning. I enjoy fying.

19

PLOTTING SVD EXAMPLE IN PYTHON

Corpusn I like NLP. I like deep learning. I enjoy fying.

NLP and deep arent directly
connected in the corpus ( is 1),
but have common context (like)

20

COMPUTING WORD SIMILARITY

Corpusn I like NLP. I like deep learning. I enjoy fying.

We can compute similarity
between w

i
and w

j
by comparing:

U[i, 0:k] and U[j, 0:k]

21

COMPUTING WORD SIMILARITY

Given 2 words w
1
and w

2
, similarity is computed as:

Dot/inner product, which equates:
|w

1
| * |w

2
| * cos()

Cosine of angle between vectors

Length of w
2

Length of w
1

22

COMPUTING WORD SIMILARITY

Given 2 words w
1
and w

2
, similarity is computed as:

Dot/inner product, which equates:
|w

1
| * |w

2
| * cos()

High similarity for:

near-parallel vectors with high values in same dimensions.

Low similarity for:
orthogonal vectors, low value vectors.

23

PROS AND CONS OF SVD

Pron has shown to perform well in a number of tasks.

Conn dimensions need to change as new words are added to the
corpus, costly.

Conn resultng vectors can stll be high dimensional and sparse.

Conn Quadratc cost to perform SVD.

ALTERNATIVES TO SVD

25

ALTERNATIVE: ITERATION BASED METHODS

Low dimensional, dense vectorsinstead of high dimensional,
sparse vectors.

Instead of computng co-occurrences from entre corpus, predict
surrounding words in a window of length c of every word.

Rely on a rule that can be updated.

This will be faster and can easily incorporatea new
sentence/document or add a word to the vocabulary

This is the idea behind word2vec (Mikolov 2013)

26

WORD2VEC: CBOW AND SKIPGRAM MODELS

Contnuous bag of words model (CBOW)nA language model
where we approximate a word from its lef and right context
within a window sized c.

i.e. from context to the word.

Skip gram modeln A language model where we approximate the
words surrounding a given a word within a window sized c to the
lef and right of the word.

i.e., from the word to the context.

CBOW and skip grams: reverse of each other.

27

WORD2VEC: WHY IS IT COOL?

They are very good for encoding similarity.

28

WORD2VEC: WHY IS IT COOL?

They are very good for inferring word relatonsn

v(Paris) v(France) + v(Italy) = v(Rome)

v(king) v(man) + v(woman) = v(queen)

29

PROS AND CONS: ITERATION BASED METHODS

Pron Do not need to operate on entre corpus which involves very
sparse matrices.

Pron Can capture semantc propertes of words as linear
relatonships between word vectors.

Pron Fast and can be easily updated with new sentences
(complexity in the order of O(|C|) ).

Conn Cant take into account the vast amount of repetton in the
data.

30

ANOTHER ALTERNATIVE: GLOVE

Glove (Pennington et al. 2014), is a count based method, does
dimensionality reducton, has similar performance to word2vec.

Does matrix factorisaton.

Can leverage repettons in the corpus as using the entre word
co-occurrence matrix.

How? Train on non-zero entries of a global word co-occurrence
matrix from a corpus, rather than the entre sparse matrix or
local word context.

31

GLOVE

Computatonally expensive for frst tme, then much fasteras
non-zero entries are much smaller than words in corpus.

The intuiton is that relatonships between words should be
explored in terms of the ratos of their co-occurrence
probabilites with some probe words k.

32

GLOVE: VISUALISATION

33

GLOVE: VISUALISATION

34

GLOVE: VISUALISATION

35

GLOVE: VISUALISATION

Want to play around?

htps://lamyiowce.github.io/word2viz/

36

PYTHON: USING WORD2VEC

Preparing the input:
Word2Vec takes lists of lists of words (lists of sentences) as input.

e.g.:

sentences = [[this, is, my, first, sentence],
[a, short, sentence],
[another, sentence],
[and, this, is, the, last, one]]

37

PYTHON: USING WORD2VEC

Training the model:

model = Word2Vec(sentences, min_count=10, size=300, workers=4)

We will only train vectors for
words occurring 10+ times in
the corpus

We want to produce word
vectors of 300 dimensions

We want to parallellise the task
running 4 processes

38

PYTHON: USING WORD2VEC

Its memory intensive!

It stores matrices: #vocabulary (dependent on min_count), #size
(size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM. If you have:
100,000 unique words, size=200, the model will require approx.:

100,000*200*4*3 bytes = ~229MB.

39

PYTHON: USING WORD2VEC

Evaluation:

Its unsupervised, there is no intrinsic way of evaluating.

Extrinsic evaluation: test your model in a text classification,
sentiment analysis, machine translation, task!

Does it outperform other methods (e.g. bag-of-words)?
Compare two models A and B: which ones better?

40

PYTHON: USING WORD2VEC

Storing a model:

model = Word2Vec.load_word2vec_format(mymodel.txt, binary=False)

or
model = Word2Vec.load_word2vec_format(mymodel.bin.gz, binary=True)

Resuming training:

model = gensim.models.Word2Vec.load(mymodel.bin.gz)

model.train(more_sentences)

41

PYTHON: USING WORD2VEC

42

PYTHON: USING WORD2VEC

This will give us the vector representaton of computer:

v(computer) = {-0.00449447, -0.00310097, }

How do we then the vector representatons for sentences, e.g.:

I have installed Ubuntu on my computer

43

PYTHON: USING WORD2VEC

Vector representatons for sentences , e.g.:

I have installed Ubuntu on my computer

Standard practce is either of:

Summing word vectors (they have the same dimensionality):
v(I) + v(have) + v(installed) + v(Ubuntu) +

Getting the average of word vectorsn
(v(I) + v(have) + v(installed) + ) / 7

44

PRE-TRAINED WORD VECTORS

One can train a model from a large corpus (millions, if not billions
of sentences). Can be tme-consuming, memory-intensive.

Pre-trained models are available.

Remember to choose a suitable pre-trained model.

Dont use word vectors pre-trained from news artcles when
youre working with social media!

45

PRE-TRAINED WORD VECTORS

Gloves pre-trained vectors:
htps://nlp.stanford.edu/projects/glove/

46

PRE-TRAINED WORD VECTORS

Pre-trained word vectors for 30+ languages (from Wikipedia):
htps://github.com/Kyubyong/wordvectors

47

PRE-TRAINED WORD VECTORS

UK Twiter word embeddings:
htps://fgshare.com/artcles/UK_Twiter_word_embeddings_II_/5791650

48

REFERENCES

Gensim (word2vec):
https://radimrehurek.com/gensim/

Word2vec tutorial:
https://rare-technologies.com/word2vec-tutorial/

FastText:
https://github.com/facebookresearch/fastText/

GloVe: Global Vectors for Word Representation:
https://nlp.stanford.edu/projects/glove/

49

ASSOCIATED READING

Not yet part of Jurafskys book.
SeeDeep learning for NLP CS224d lectures 1 and 2.

http://cs224d.stanford.edu/syllabus.html
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013).

Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing
systems (pp. 3111-3119).

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global
vectors for word representation. In Proceedings of the 2014
conference on empirical methods in natural language processing
(EMNLP) (pp. 1532-1543).

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] python deep learning lecture06.pptx
$25