Text Pre-Processing 2
Text Pre-Processing 2
Faculty of Information Technology, Monash University, Australia
FIT5196 week 5
(Monash) FIT5196 1 / 15
Outline
1 Inverted Index
2 Vector Space Model
3 TF-IDF
4 Collocations
(Monash) FIT5196 2 / 15
Inverted Index
Inverted Index
Figure: This figure is adopted from the book called Introduction to Information Retrieval, which can be viewed here
http://nlp.stanford.edu/IR-book/
A dictionary of terms (referred to as a vocabulary or lexicon)
Each term is associated with a list that records which documents the term
occurs in.
(Monash) FIT5196 3 / 15
Inverted Index
Inverted Index
To build an inverted index
1 Collect the documents to be indexed.
2 Tokenize the text, turning each document into a list of tokens
3 Do linguistic preprocessing, producing a list of normalized tokens, which are
the indexing terms
4 Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.
(Monash) FIT5196 4 / 15
Inverted Index
Inverted Index
Figure: This figure is adopted from the book called Introduction to Information Retrieval, which can be viewed here
http://nlp.stanford.edu/IR-book/(Monash) FIT5196 5 / 15
Vector Space Model
Vector Space Model
VSM: each text document is represented as a vector where the elements of
the vector indicate the occurrence of words within the text
V (d) = [w1,d ,w2,d ,w3,d , . . . ,wN,d ]
where
N: the size of the vocabulary
wn,d : the weight of the n-th term (in the vocabulary) in document d
Examples:
document_1: Data analysis is important.
document_2: Data wrangling is as important as data analysis.
document_3: Data science contains data analysis and data wrangling.
(Monash) FIT5196 6 / 15
Vector Space Model
Vector Space Model
Examples:
document_1: Data analysis is important.
document_2: Data wrangling is as important as data analysis.
document_3: Data science contains data analysis and data wrangling.
Vector representation
analysis and as contains data important is science wrangling
document_1 1 0 0 0 1 1 1 0 0
document_2 1 0 1 0 1 1 1 0 1
document_3 1 1 0 1 1 0 0 1 1
document_1 1 0 0 0 1 1 1 0 0
document_2 1 0 2 0 2 1 1 0 1
document_3 1 1 0 1 3 0 0 1 1
document_1 0 0 0 0 0 0.176 0.176 0 0
document_2 0 0 0.954 0 0 0.176 0.176 0 0.176
document_3 0 0.477 0 0.477 0 0 0 0.477 0.176
(Monash) FIT5196 6 / 15
Vector Space Model
Vector Space Model Euclidean Distance
Figure: This figure is adopted from the lecture slides on
https://www.cl.cam.ac.uk/teaching/1314/InfoRtrv/lecture4.pdf
(Monash) FIT5196 7 / 15
Vector Space Model
Vector Space Model Cosine Similarity
Figure: This figure is adopted from the book called Introduction to Information Retrieval, which can be viewed here
http://nlp.stanford.edu/IR-book/
(Monash) FIT5196 8 / 15
Vector Space Model
Vector Space Model Cosine Similarity
sim(d1, d2) =
V (d1) V (d2)
|V (d1)||V (d2)|
=
N
n=1 Vn(d1)Vn(d2)
N
n=1 V
2
n (d1)
N
n=1 V
2
n (d2)
(Monash) FIT5196 9 / 15
Vector Space Model
Vector Space Model Cosine Similarity
sim(d1, d2) =
V (d1) V (d2)
|V (d1)||V (d2)|
=
N
n=1 Vn(d1)Vn(d2)
N
n=1 V
2
n (d1)
N
n=1 V
2
n (d2)
Notation:
Vn(d): the weight of the n-th vocabulary term in document d
: indicates inner product
|V (d)|: Euclidean length
The cosine similarity of d1 and d2 is equivalent to the cosine of the angle
between d1 and d2.
Cosine is a monotonically decreasing function of the angle for the interval
[0, 180]
(Monash) FIT5196 9 / 15
Vector Space Model
Vector Space Model Bag of Words
The Bag-of-Words assumption:
The order of the words in a document does not matter.
Not a problem for many text analysis tasks, such as document classification
and clustering
A collection of words is usually sufficient to differentiate between semantic
concepts of documents.
Not a universal solution for, such as information extraction, POS tagging and
many other NLP tasks, where the order of words does matter.
Consideration of using VSM:
The dimensionality of the vectors in VSM.
How the compute the weights of words in each document.
binary values
counts
TF-IDF weights
(Monash) FIT5196 10 / 15
TF-IDF
Term Frequency
TF: the number of occurrences of a word in a document.
How well does a word represent its document?
frequent (non-stop) words are thematic
if a words t appears often in a document, then a document containing t
should be similar to that document.
Different ways of computing TF
weighting scheme TF
binary 0,1
raw frequency ft,d
log normalisation 1+ log(ft,d )
double normalisation K K + (1 K ) ft,d
max(t d)ft,d
(Monash) FIT5196 11 / 15
TF-IDF
Term Frequency
TF: the number of occurrences of a word in a document.
How well does a word represent its document?
frequent (non-stop) words are thematic
if a words t appears often in a document, then a document containing t
should be similar to that document.
Problems:
All words are considered equally important.
Certain terms have little or no discriminating power in, for example, text
classification.
All the medical documents contain words patient and disease
All the patents contain words like claim, embodiment , etc.
Can we scale down the term weights of terms ?
(Monash) FIT5196 11 / 15
TF-IDF
Inverse Document Frequency
IDF: Inverse document frequency
The idf of a rare term is high, whereas the idf of a frequent term is likely to
be low
idft = log
D
nt
where D is the total number of documents in the corpus, and
nt = |{d D : t d}|
if a word/term appears in every document, nt = D, then idft = 0
if a word/term appears in a document, nt = 1, then idft = log(D)
Different ways of computing IDF
weighting scheme IDF weight
IDF log
D
nt
IDF smoothed log
1+ Dnt
IDF max log
1+
maxtdnt
nt
IDF probability log
Dnt
nt
(Monash) FIT5196 12 / 15
TF-IDF
TF-IDF
The TF-IDF weight schema
tf idft = tft idft = nd ,t log
D
nt
where nd ,t : #occurrences of word t in document d , mt : the document
frequency of term t.
most frequent words
less frequent words
(Monash) FIT5196 13 / 15
Collocations
Colllocations
Collocations: multi-word expressions consisting of two or more words that
occur more frequently than by chance, and correspond to some conventional
way of saying things.
Noun phrases: strong tea and weapons of mass destruction
Verb phrases: make up, wind up
Collocations are characterised by limited compositionality.
Non-compositional phrases:
Difficult to predict the meaning of collocation from the meaning of its parts
Add an element of meaning to the combination
Example:
He is known for his fair and square dealings and everybody trusts his
work.1
I had a shooting pain near the heart.
1Example from http://language.worldofcomputing.net/nlp-overview/collocations.html
(Monash) FIT5196 14 / 15
Collocations
Colllocations
Collocations: multi-word expressions consisting of two or more words that
occur more frequently than by chance, and correspond to some conventional
way of saying things.
Noun phrases: strong tea and weapons of mass destruction
Verb phrases: make up, wind up
Collocations are characterised by limited compositionality.
Compositional phrases
the meaning of the expression can be predicted from the meaning of the parts.
Example:
I had a stomach pain yesterday.
(Monash) FIT5196 14 / 15
Collocations
Colllocations
Collocations: multi-word expressions consisting of two or more words that
occur more frequently than by chance, and correspond to some conventional
way of saying things.
Noun phrases: strong tea and weapons of mass destruction
Verb phrases: make up, wind up
Collocations are characterised by limited compositionality.
How to extract collocation
NLTK Collocation package
Simple tutorial: http://www.nltk.org/howto/collocations.html
(Monash) FIT5196 14 / 15
Collocations
Summary: what do you need to do in this week?
What we have covered:
Vector Space Model
Term weighting
Collocations
Count vocabulary
What you need to do:
Download and read the chapters provided in Moodle
Attend tutorial 5 next week.
Remember: assessment 1 is due next week.
(Monash) FIT5196 15 / 15
Inverted Index
Vector Space Model
TF-IDF
Collocations
Reviews
There are no reviews yet.