[SOLVED] scheme information retrieval data science Text Pre-Processing 2

$25

File Name: scheme_information_retrieval_data_science_Text_Pre-Processing__2.zip
File Size: 602.88 KB

5/5 - (1 vote)

Text Pre-Processing 2

Text Pre-Processing 2

Faculty of Information Technology, Monash University, Australia

FIT5196 week 5

(Monash) FIT5196 1 / 15

Outline

1 Inverted Index

2 Vector Space Model

3 TF-IDF

4 Collocations

(Monash) FIT5196 2 / 15

Inverted Index

Inverted Index

Figure: This figure is adopted from the book called Introduction to Information Retrieval, which can be viewed here
http://nlp.stanford.edu/IR-book/

A dictionary of terms (referred to as a vocabulary or lexicon)
Each term is associated with a list that records which documents the term
occurs in.

(Monash) FIT5196 3 / 15

Inverted Index

Inverted Index

To build an inverted index
1 Collect the documents to be indexed.
2 Tokenize the text, turning each document into a list of tokens
3 Do linguistic preprocessing, producing a list of normalized tokens, which are

the indexing terms
4 Index the documents that each term occurs in by creating an inverted index,

consisting of a dictionary and postings.

(Monash) FIT5196 4 / 15

Inverted Index

Inverted Index

Figure: This figure is adopted from the book called Introduction to Information Retrieval, which can be viewed here
http://nlp.stanford.edu/IR-book/(Monash) FIT5196 5 / 15

Vector Space Model

Vector Space Model

VSM: each text document is represented as a vector where the elements of
the vector indicate the occurrence of words within the text

V (d) = [w1,d ,w2,d ,w3,d , . . . ,wN,d ]

where
N: the size of the vocabulary
wn,d : the weight of the n-th term (in the vocabulary) in document d

Examples:
document_1: Data analysis is important.
document_2: Data wrangling is as important as data analysis.
document_3: Data science contains data analysis and data wrangling.

(Monash) FIT5196 6 / 15

Vector Space Model

Vector Space Model

Examples:
document_1: Data analysis is important.
document_2: Data wrangling is as important as data analysis.
document_3: Data science contains data analysis and data wrangling.

Vector representation
analysis and as contains data important is science wrangling

document_1 1 0 0 0 1 1 1 0 0
document_2 1 0 1 0 1 1 1 0 1
document_3 1 1 0 1 1 0 0 1 1

document_1 1 0 0 0 1 1 1 0 0
document_2 1 0 2 0 2 1 1 0 1
document_3 1 1 0 1 3 0 0 1 1

document_1 0 0 0 0 0 0.176 0.176 0 0
document_2 0 0 0.954 0 0 0.176 0.176 0 0.176
document_3 0 0.477 0 0.477 0 0 0 0.477 0.176

(Monash) FIT5196 6 / 15

Vector Space Model

Vector Space Model Euclidean Distance

Figure: This figure is adopted from the lecture slides on
https://www.cl.cam.ac.uk/teaching/1314/InfoRtrv/lecture4.pdf

(Monash) FIT5196 7 / 15

Vector Space Model

Vector Space Model Cosine Similarity

Figure: This figure is adopted from the book called Introduction to Information Retrieval, which can be viewed here
http://nlp.stanford.edu/IR-book/

(Monash) FIT5196 8 / 15

Vector Space Model

Vector Space Model Cosine Similarity

sim(d1, d2) =
V (d1) V (d2)
|V (d1)||V (d2)|

=

N
n=1 Vn(d1)Vn(d2)

N
n=1 V

2
n (d1)

N
n=1 V

2
n (d2)

(Monash) FIT5196 9 / 15

Vector Space Model

Vector Space Model Cosine Similarity

sim(d1, d2) =
V (d1) V (d2)
|V (d1)||V (d2)|

=

N
n=1 Vn(d1)Vn(d2)

N
n=1 V

2
n (d1)

N
n=1 V

2
n (d2)

Notation:
Vn(d): the weight of the n-th vocabulary term in document d
: indicates inner product
|V (d)|: Euclidean length

The cosine similarity of d1 and d2 is equivalent to the cosine of the angle
between d1 and d2.
Cosine is a monotonically decreasing function of the angle for the interval
[0, 180]

(Monash) FIT5196 9 / 15

Vector Space Model

Vector Space Model Bag of Words

The Bag-of-Words assumption:
The order of the words in a document does not matter.
Not a problem for many text analysis tasks, such as document classification
and clustering

A collection of words is usually sufficient to differentiate between semantic
concepts of documents.

Not a universal solution for, such as information extraction, POS tagging and
many other NLP tasks, where the order of words does matter.

Consideration of using VSM:
The dimensionality of the vectors in VSM.
How the compute the weights of words in each document.

binary values
counts
TF-IDF weights

(Monash) FIT5196 10 / 15

TF-IDF

Term Frequency

TF: the number of occurrences of a word in a document.
How well does a word represent its document?
frequent (non-stop) words are thematic
if a words t appears often in a document, then a document containing t
should be similar to that document.

Different ways of computing TF

weighting scheme TF
binary 0,1

raw frequency ft,d
log normalisation 1+ log(ft,d )

double normalisation K K + (1 K ) ft,d
max(t d)ft,d

(Monash) FIT5196 11 / 15

TF-IDF

Term Frequency

TF: the number of occurrences of a word in a document.
How well does a word represent its document?
frequent (non-stop) words are thematic
if a words t appears often in a document, then a document containing t
should be similar to that document.

Problems:
All words are considered equally important.
Certain terms have little or no discriminating power in, for example, text
classification.

All the medical documents contain words patient and disease
All the patents contain words like claim, embodiment , etc.

Can we scale down the term weights of terms ?

(Monash) FIT5196 11 / 15

TF-IDF

Inverse Document Frequency

IDF: Inverse document frequency
The idf of a rare term is high, whereas the idf of a frequent term is likely to
be low

idft = log

D
nt

where D is the total number of documents in the corpus, and
nt = |{d D : t d}|

if a word/term appears in every document, nt = D, then idft = 0
if a word/term appears in a document, nt = 1, then idft = log(D)

Different ways of computing IDF
weighting scheme IDF weight

IDF log
D

nt

IDF smoothed log

1+ Dnt

IDF max log

1+
maxtdnt

nt

IDF probability log

Dnt
nt

(Monash) FIT5196 12 / 15

TF-IDF

TF-IDF

The TF-IDF weight schema

tf idft = tft idft = nd ,t log

D
nt

where nd ,t : #occurrences of word t in document d , mt : the document
frequency of term t.
most frequent words
less frequent words

(Monash) FIT5196 13 / 15

Collocations

Colllocations

Collocations: multi-word expressions consisting of two or more words that
occur more frequently than by chance, and correspond to some conventional
way of saying things.
Noun phrases: strong tea and weapons of mass destruction
Verb phrases: make up, wind up

Collocations are characterised by limited compositionality.
Non-compositional phrases:

Difficult to predict the meaning of collocation from the meaning of its parts
Add an element of meaning to the combination
Example:

He is known for his fair and square dealings and everybody trusts his
work.1

I had a shooting pain near the heart.

1Example from http://language.worldofcomputing.net/nlp-overview/collocations.html
(Monash) FIT5196 14 / 15

Collocations

Colllocations

Collocations: multi-word expressions consisting of two or more words that
occur more frequently than by chance, and correspond to some conventional
way of saying things.
Noun phrases: strong tea and weapons of mass destruction
Verb phrases: make up, wind up

Collocations are characterised by limited compositionality.
Compositional phrases

the meaning of the expression can be predicted from the meaning of the parts.
Example:

I had a stomach pain yesterday.

(Monash) FIT5196 14 / 15

Collocations

Colllocations

Collocations: multi-word expressions consisting of two or more words that
occur more frequently than by chance, and correspond to some conventional
way of saying things.
Noun phrases: strong tea and weapons of mass destruction
Verb phrases: make up, wind up

Collocations are characterised by limited compositionality.
How to extract collocation
NLTK Collocation package
Simple tutorial: http://www.nltk.org/howto/collocations.html

(Monash) FIT5196 14 / 15

Collocations

Summary: what do you need to do in this week?

What we have covered:
Vector Space Model
Term weighting
Collocations
Count vocabulary

What you need to do:
Download and read the chapters provided in Moodle
Attend tutorial 5 next week.
Remember: assessment 1 is due next week.

(Monash) FIT5196 15 / 15

Inverted Index
Vector Space Model
TF-IDF
Collocations

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] scheme information retrieval data science Text Pre-Processing 2
$25