5/5 - (1 vote)

PowerPoint Presentation

LECTURE 11

Word Senses and Similarity

Arkaitz Zubiaga, 14th February, 2018

Word Senses: Concepts.

Thesauri: Wordnet.

Thesaurus Methods.

Distributonal Models of Similarity.

Evaluaton.

LECTURE 11: CONTENTS

WORD SENSES: CONCEPTS

Homonymy: same word can have diferent, unrelated meanings:

I put my money in the bank
1
.

We took the picture in the east bank
2
of the Nile river.

bank
1
: fnancial insttuton.

bank
2
: sloping land.

HOMONYMY

Polysemy: same word, related meanings:

The bank
1
was constructed in 1875 out of local red brick.

I withdrew the money from the bank
2
.

bank
1
: the building belonging to a fnancial insttuton.

bank
2
: a fnancial insttuton.

POLYSEMY

Polysemy is ofen systemati:

Building & organisaton:

Im heading to the university.

The university staf has gone on strike.

Author & work:

Shakespeare wrote nice stuf.

I love reading Shakespeare.

SYSTEMATIC POLYSEMY

Synonyms: words with same meaning in some or all contexts.

flbert / hazelnut

couch / sofa

big / large

SYNONYMY

But there are few (or no) examples of perfect synonymy.

e.g. big/large:

My big sister [= older sister]

My large sister [ older sister]

SYNONYMY

Antonyms: Senses that are opposites with respeit to one
feature of meaning, very similar otherwise:

dark/light.

short/long.

fast/slow.

hot/cold.

up/down.

ANTONYMY

One sense is a hyponym of another if the frst sense is more
speiifi, denotng a subilass of the other:

car is a hyponym of vehicle

apple is a hyponym of fruit

Conversely hypernym:

vehicle is a hypernymof car

fruit is a hypernym of apple

HYPONYMY AND HYPERNYMY

An instanie is an individual, a proper noun that is a unique
entty:

London is an instanie of city.

But city is a ilass:

city is a hyponym of municipality or locaton.

INSTANCES VS HYPONYMS

THESAURI

Thesaurus (plural thesauri) is a reference work that lists words
grouped together according to similarity of meaning.

Useful for different tasks (which well see in next lectures):
Information Extraction.
Information Retrieval.
Question Answering.
Machine Translation.

WORDNET 3.1

Wordnet is a popular dataset (thesaurus + aspects of dictionary):
https://wordnet.princeton.edu/

Its integrated in NLTK:
http://www.nltk.org/howto/wordnet.html

Structured into 117,000 synsets (synonym sets).
Linked to other synsets through conceptual relations.
Synsets have brief definition (gloss) and example sentences.

https://wordnet.princeton.edu/
http://www.nltk.org/howto/wordnet.html

SENSES OF BASS IN WORDNET

SYNSETS: SYNONYM SETS

Example: chump as a noun with the gloss:
a person who is gullible and easy to take advantage of

This sense of chump is shared by 9 words:

chump
1
, fool

2
, gull

1
, mark

9
, patsy

1
, fall guy

1
, sucker

1
, soft touch

1
,

mug
2

Note: only those senses of the words, e.g. mug
1
is a cup,

belongs to a different synset.

WORDNET HYPERNYM HIERARCHY FOR BASS

WORDNET NOUN RELATIONS

THESAURUS METHODS

WORD SIMILARITY AND RELATEDNESS

Synonymy is binary.
Similarity is a looser metric, two words are similar when they share

features of meaning.

bank
1
is similar to fund

3
.

bank
2
is similar to slope

5
.

Relatedness measures possible associations.
motorbike and bike are similar.
motorbike and fuel are related, not similar.

TWO TYPES OF SIMILARITY ALGORITHMS

Thesaurus-based algorithms:
Are words nearby in hypernym hierarchy?
Do words have similar glosses (definitions)?

Distributional algorithms:
Do words have similar distributional contexts?

PATH-BASED SIMILARITY

Two senses are similar if there is
a short path between them in the
thesaurus hierarchy

pathlen(c1,c2) = 1 + # of edges in shortest path

simpath(c1,c2) =

simpath(nickel,coin) = 1/2 = .5
simpath(nickel,Richter scale) = 1/8 = .125

PATH-BASED SIMILARITY

Two senses are similar if there is
a short path between them in the
thesaurus hierarchy

pathlen(c1,c2) = 1 + # of edges in shortest path

simpath(c1,c2) =

simpath(nickel,coin) = 1/2 = .5
simpath(nickel,Richter scale) = 1/8 = .125

Problem: assumes uniform distance for all
links.
simpath(nickel, money) = 6
simpath(nickel, standard) = 6

ALTERNATIVE THESAURUS-BASED SIMILARITY

Thesaurus-based similarity algorithms:
Information content for similarity measurement (Resnik).
Lin similarity function.
The Lesk algorithm.

INFORMATION CONTENT

Train by counting in a corpus.
Each instance of hill counts towards

frequency of its hypernyms:
natural elevation, geological formation, etc

words(c): set of words children of node c

words(geo-formation) = {hill,ridge,grotto,coast,cave,shore,natural elevation}
words(natural elevation) = {hill, ridge}

P(c)=

count(w)
wwords(c)

Probability that a random word
in the corpus is instance of c

geological-formaton

shore

hill

natural elevaton

coast

cave

grotoridge

entty

(Resnik 1995)

INFORMATION CONTENT

Information content:
IC(c) = -log P(c)

Lowest common subsumer:
LCS(c1,c2)

The most informative (lowest) node
in the hierarchy subsuming both c1 and c2

e.g. LCS(hill, coast) = geological-formation

INFORMATION CONTENT FOR SIMILARITY

Intuition: the more two words have in common, the more similar they
are.

Resnik: similarity = information content of the lowest common
subsumer (LCS) of the two nodes.

sim
resnik

(c1,c2) = -log P(LCS(c1,c2))

(Resnik 1995, 1999)

DEKANG LIN METHOD

Intuition: look at both commonalities and differences to measure
similarity of words A and B.

Commonality: the more A and B have in common, the more
similar they are.

IC(common(A,B))

Difference: the more differences between A and B, the less
similar.

IC(description(A,B)-IC(common(A,B))
(Lin 1998)

DEKANG LIN SIMILARITY THEOREM

Similarity between A and B is measured by the ratio between:
the amount of information needed to state the commonality of

A and B.
the information needed to fully describe what A and B are.

Lin defines IC(common(A,B)) as 2 x information of the LCS.

simLin(c1,c2 )=
2 logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )

simLin(A,B)
IC(common(A,B))
IC(description(A,B))

EXAMPLE OF LIN SIMILARITY FUNCTION

simLin(A,B)=
2 logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )

simLin(hill, coast)=
2 logP(geological-formation)
logP(hill)+ logP(coast)

=
2 ln0.00176

ln0.0000189+ ln0.0000216
=.59

THE LESK ALGORITHM

Intuition: A and B are similar if their glosses contain similar words.

Drawing paper: paper that is specially prepared for use in drafting.

Decal: the art of transferring designs from specially prepared paper to a wood or
glass or metal surface.

For each word phrase of length n thats in both glosses:

Add a score of n2 (paper and specially prepared: 12 + 22 = 5).
Compute overlap also for glosses of hypernyms and hyponyms.

(Lesk 1986)

DISTRIBUTIONAL MODELS OF
SIMILARITY

THESAURUS-BASED APPROACHES: LIMITATIONS

We dont have a thesaurus for every language.
Even if we do, they have problems with coverage:

Many words are missing.
Most (if not all) phrases are missing.
Some connections between senses are missing.
Thesauri work less well for verbs and adjectives, which have

less structured hyponymy relations.

DISTRIBUTIONAL MODELS OF MEANING

Also called vector-space models of meaning.
Offer much higher recall than hand-built thesauri.

Although they tend to have lower precision.

Intuition of distributional models of meaning:
If A and B have almost identical environments we say that

they are synonyms.
Remember lecture 6, word embeddings?

SYNONYMS IN SIMILAR ENVIRONMENTS

tackle the task, tackle fake news.
deal with the task, deal with fake news.

tackle = deal with

I like chocolate, chocolate is tasty, recipe for chocolate.
I like NLP, NLP is hard, Im in an NLP lecture.

NLP chocolate

How do we achieve this?

WORD-WORD CO-OCCURRENCE MATRIX

Again, word-word co-occurrence matrix.

SAMPLE CONTEXTS: BROWN CORPUS

equal amount of sugar, a sliced lemon, a tablespoonful of apricot preserve or
jam, a pinch each of clove and nutmeg,

on board for their enjoyment. Cautiously she sampled her first pineapple and
another fruit whose taste she likened to that of

of a recursive type well suited to programming on the digital computer. In
finding the optimal R-stage policy from that of

substantially affect commerce, for the purpose of gathering data and
information necessary for the study authorized in the first section of this

TERM-CONTEXT MATRIX FOR SIMILARITY

Two words are similar in meaning if their context vectors are similar.

aardvark computer data pinch result sugar

apricot 0 0 0 1 0 1

pineapple 0 0 0 1 0 1

digital 0 2 1 0 1 0

information 0 1 6 0 4 0

POINTWISE MUTUAL INFORMATION

Instead of raw counts, Pointwise Mutual Information (PMI) is often used.
Do events x and y co-occur more than if they were independent?

PMI between two words: (Church & Hanks 1989)
Do words x and y co-occur more than if they were independent?

Positive PMI between two words (Niwa & Nitta 1994)
Replace all PMI values less than 0 with zero

PMI(X,Y)=log2
P(x,y)
P(x)P(y)

PMI(word1,word2 )=log2
P(word1,word2)
P(word1)P(word2)

COMPUTING PMI

P(w=information,c=data) = 6/19 = .32
P(w=information) = 11/19 = .58
P(c=data) = 7/19 = .37

p(w,context) p(w)
computer data pinch result sugar

apricot 0.00 0.00 0.05 0.00 0.05 0.11

pineapple 0.00 0.00 0.05 0.00 0.05 0.11

digital 0.11 0.05 0.00 0.05 0.00 0.21

information 0.05 0.32 0.00 0.21 0.00 0.58

p(context) 0.16 0.37 0.11 0.26 0.11

COMPUTING PMI

PMI(information,data) = log
2
(.32 / (.37 * .58)) = .57

p(w,context) p(w)
computer data pinch result sugar

apricot 0.00 0.00 0.05 0.00 0.05 0.11

pineapple 0.00 0.00 0.05 0.00 0.05 0.11

digital 0.11 0.05 0.00 0.05 0.00 0.21

information 0.05 0.32 0.00 0.21 0.00 0.58

p(context) 0.16 0.37 0.11 0.26 0.11

PPMI(w,context)
computer data pinch result sugar

apricot 2.25 2.25

pineapple 2.25 2.25

digital 1.66 0.00 0.00

information 0.00 0.57 0.47

pmiij =log2
pij

pi*p* j

WEIGHING PMI

PMI is biased toward infrequent events.
Various weighting schemes help alleviate this, see e.g. Turney and Pantel

(2010)
Add-one smoothing can also help.

STATE-OF-THE-ART WORD SIMILARITY

Currently, the state of the art approach for measuring word similarity is using
word embeddings.

See Lecture 6!

RELATED TASK: WSD

Word Sense Disambiguation (WSD).
Related task in which we aim to identify which specific sense of a word

is being used in a particular instance in text.

I put my money in the bank bank
1

We took the picture in the east bank of the Nile river bank
2

As with computation of similarity, context can help WSD.

https://www.cs.york.ac.uk/semeval-2013/task12/

EVALUATION

TWO TYPES OF EVALUATION

Evaluation can be:
Intrinsic (in-vitro) evaluation.
Extrinsic (in-vivo) evaluation.

INTRINSIC EVALUATION

Need a corpus with human-annotated similarity scores.

Correlation between algorithm and human word similarity ratings.

The WordSimilarity-353 Test Collection:
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

EXTRINSIC EVALUATION

A number of tasks to test with:
Spelling error detection.
Word sense disambiguation.
Taking multiple-choice vocabulary tests (TOEFL/Cambridge).

RESOURCES

WordNet web interface:
http://wordnetweb.princeton.edu/perl/webwn

MeSH (Medical Subject Headings) thesaurus:
https://www.ncbi.nlm.nih.gov/mesh

http://wordnetweb.princeton.edu/perl/webwn
https://www.ncbi.nlm.nih.gov/mesh

REFERENCES

Resnik, P. (1995). Using Information Content to Evaluate Semantic Similarity
in a Taxonomy. In Proceedings of the 14th International Joint Conference on
Artificial Intelligence (IJCAI-95).

Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based
measure and its application to problems of ambiguity in natural language. J.
Artif. Intell. Res.(JAIR), 11, 95-130.

Lin, D. (1998). An information-theoretic definition of similarity. In ICML, pp.
296-304.

Lesk, M. (1986). Automatic sense disambiguation using machine readable
dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of
the 5th annual international conference on Systems documentation (pp. 24-
26). ACM.

REFERENCES

Church, K., Gale, W., Hanks, P., & Hindle, D. (1989). Word associations and
typical predicate-argument relations. In In Proceedings of the International
Workshop on Parsing Technologies.

Niwa, Y., & Nitta, Y. (1994, August). Co-occurrence vectors from corpora vs.
distance vectors from dictionaries. In Proceedings of the 15th conference on
Computational linguistics-Volume 1 (pp. 304-309). Association for
Computational Linguistics.

Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space
models of semantics. Journal of artificial intelligence research, 37, 141-188.

ASSOCIATED READING

Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 17.

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Slide 51
Slide 52

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] scheme information retrieval algorithm PowerPoint Presentation

Reviews

Whatsapp Us

[SOLVED] scheme information retrieval algorithm PowerPoint Presentation

Reviews

Related products

[Solved] Program that reads in the file climate_data_2017_numeric.csv

[Solved] Python Program 8 solved

[Solved] Car.py

[Solved] Program that has three functions: sepia(), remove_all_red(), and gray_scale()

[SOLVED] Naughty Receiver – Reliable Data Transfer

[SOLVED] Basic Calculator Client and Server