Text preprocessing
School of Computing and Information Systems
@University of Melbourne 2022
Copyright By Assignmentchef assignmentchef
Text processing for machine learning
Text is a sequence of characters.
ML algorithms understand numeric vectors. The aim
Cars are driven on the road.
COMP20008 Elements of Data Processing
Text processing for machine learning
Features retain as much meaning/information as possible Reduce the sparsity of the feature vector
Cars are driven on the road.
COMP20008 Elements of Data Processing
Preprocessing Tokenisation
Granularity of a token Sentence
Token separators
The speaker did her Ph.D. in Germany. She now works at UniMelb. The issueand there are manyis that text is not consistent.
COMP20008 Elements of Data Processing
Preprocessing Tokenisation
Split continuous text into a list of individual tokens
English words are often separated by white spaces but not always Tokens can be words, numbers, hashtags, etc.
Can use regular expression
COMP20008 Elements of Data Processing
Preprocessing Case folding
Convert text to consistent cases
Simple and effective for many tasks
Reduce sparsity (many map to the same lower-case form) Good for search
I had an AMAZING trip to Italy, Coffee is only 2 bucks, sometimes three! The coffee is so nice.
i had an amazing trip to italy, coffee is only 2 bucks, sometimes three! the coffee is so nice.
COMP20008 Elements of Data Processing
Preprocessing Stemming
Words in English are derived from a root or stem inexpensive in+expense+ive
Driving my car
Taking a drive in my car He drives my car
COMP20008 Elements of Data Processing
Preprocessing Stemming
Stemming attempts to undo the processes that lead to word formation Remove and replace word suffixes to arrive at a common root form
Result does not necessarily look like a proper word
Porter stemmer: one of the most widely used stemming algorithms
suffix stripping (Porter stemmer) sses ss
tional tion tiont
COMP20008 Elements of Data Processing
Preprocessing Stemming
https://text-processing.com/demo/stem/
troubles a troubl troubled a troubl trouble a troubl
COMP20008 Elements of Data Processing
Preprocessing Lemmatization
To remove inflections and map a word to its proper root form (lemma)
It does not just strip suffixes, it transforms words to valid roots: running a run
runs a run ran a run
Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.
COMP20008 Elements of Data Processing
Preprocessing Stop Word Removal
Stop words are function words that structure sentences; they are low information words and some of them are very common
the, a, is,
Exclude them from being processed; helps to reduce the number
of features/words
Commonly applied for search, text classification, topic modelling, topic extraction, etc.
A stopword list can be custom-made for a specific context/domain
COMP20008 Elements of Data Processing
Stop Word Removal
COMP20008 Elements of Data Processing
Text Normalisation
Transforming a text into a canonical (standard) form
Important for noisy text, e.g., social media comments, text
Used when there are many abbreviations, misspellings and out- of-vocabulary words (oov)
2moro a tomorrow 2mrw a tomorrow tomrw a tomorrow B4 a before
COMP20008 Elements of Data Processing
Noise Removal
Remove unnecessary spacing
Remove punctuation and special characters (regular expressions) Unify numbers
Highly domain dependent
COMP20008 Elements of Data Processing
So far Unstructured Text Data
Text search approximate string matching
Preprocessing
Regular expressions
Tokenisation
Case folding
Stemming
Lemmatization
Stop word removal
Text normalization
Noise removal
COMP20008 Elements of Data Processing
COMP20008 Elements of Data Processing
COMP20008 Elements of Data Processing
Text representation I
School of Computing and Information Systems
@University of Melbourne 2022
Sepal length
Sepal width
Petal length
Petal width
Species (label)
Iris setosa
Iris versicolor
Iris setosa
Iris virginica
COMP20008 Elements of Data Processing
How To Represent Text?
https://cis.unimelb.edu.au/about/school/
COMP20008 Elements of Data Processing
Text Representation BoW
Bag-of-words: simplest vector space representational model for text
Disregards word order and grammatical features such as POS
Each text document as a numeric vector
each dimension is a specific word from the corpus
the value could be its frequency in the document or occurrence (denoted by 1 or 0).
COMP20008 Elements of Data Processing
Prepare for BoW
Word tokenisation
Case-folding
Abstraction of number (#num#, #year#)
COMP20008 Elements of Data Processing
Prepare for BoW
Stop word removal
COMP20008 Elements of Data Processing
Prepare for BoW
Stop word removal
COMP20008 Elements of Data Processing
Prepare for BoW
Stemming
How would this look different if we lemmatised instead?
Removed punctuation
Word counts
COMP20008 Elements of Data Processing
Bag of Words
COMP20008 Elements of Data Processing
Term Frequency
What if a rare word occurs in a document?
e.g. catarrh is less common than mucus or stuffiness
What if a word occurs in many documents?
Maybe we want to avoid raw counts?
Raw frequencies varies with document length
Dont capture important (rare) features that would be telling of a type of document
COMP20008 Elements of Data Processing
Text representation II
School of Computing and Information Systems
@University of Melbourne 2022
Raw Frequencies
What are the problems?
What are the alternatives?
COMP20008 Elements of Data Processing
Raw Frequencies
What are the problems?
What are the alternatives?
play grace crowd
play grace audience
COMP20008 Elements of Data Processing
COMP20008 Elements of Data Processing
Discourse on Floating Bodies
Galileo Galilei
Treatise on Light
Christiaan Huygens
Experiments with Alternate Currents of High Potential and High Frequency
Nikola Tesla
Relativity: The Special and General Theory
TF-IDF stands for Term Frequency-Inverse Document Frequency
Each text document as a numeric vector
each dimension is a specific word from the corpus
A combination of two metrics to weight a term (word)
term frequency (tf): how often a given word appears within a document
inverse document frequency (idf): down-weights words that appear in many documents.
Main idea: reduce the weight of frequent terms and increase the weight of rare and indicative ones.
COMP20008 Elements of Data Processing
Term frequency (TF):
! !, $ = the raw count of a term in the document. Inverse Document Frequency (IDF):
&$! =ln !# +1or&$! =ln # +1 !$%! $%!
N is the number of document in the collection,
$& is the document frequency, the number of document containing the term t.
TF-IDF (L2 normalised): !_&$ !,$ = !
!$ !%
where,& =! !,$ &$ ! COMP20008 Elements of Data Processing
Example TF-IDF
Two documents: A the car is driven on the road
B the truck is driven on the highway
-./(1) = 34 !# + 1 !$%!
)* ),, 7,* ) =9&
1/_-./ 1,.
ln ( +1=1.405
ln ( +1=1.405
ln ( +1=1.405
ln ( +1=1.405
= 2.225 * stop words removed COMP20008 Elements of Data Processing
Example TF-IDF
COMP20008 Elements of Data Processing
Example TF-IDF cont.
Two documents, A and B.
A. the car is driven on the road
B. the truck is driven on the highway
* stop words removed
Text features for machine learning
COMP20008 Elements of Data Processing
3 documents:
A: the car is driven on roads
B: the truck is driven on a highway
C: a bike can not be ridden on a highway
* stop words removed
34 !# +1 !$%!
9& = )* ),, 7,* )
1/-./ 1,.
ln4/2 +1=1.6931.693
(4/3)+1=1.288
=? COMP20008 Elements of Data Processing
! !!$ ! !!$
Features from unstructured text
Features for structured data
TF-IDF features for unstructured text
COMP20008 Elements of Data Processing
Revision Finding similar texts
Edit distance
N-gram distance
Jaccard similarity
Srensen-Dice similarity
Cosine similarity d1
cos d!,dB =
C&C% ~0.202 C& C%
COMP20008 Elements of Data Processing
So far Unstructured Text Data
Text search approximate string matching
Preprocessing
Regular expressions
Tokenisation
Case folding
Stemming
Lemmatization
Stop word removal
Text normalization
Noise removal
Text representation
COMP20008 Elements of Data Processing
Other Text Features
Part-of-speech tagging
She saw a bear. bear: NOUN Your efforts will bear fruit. bear: VERB bear_NN; bear_VB
N-grams (bag of n-grams)
COMP20008 Elements of Data Processing
value_curiosity
curiosity_,
passion_and
Advanced text representation
Surface value vs Semantics A is better than B
B is better than A
Distributed Representations of Sentences and Documents Quoc Le and (Doc2Vec)
COMP20008 Elements of Data Processing
Text search approximate string matching
Preprocessing
Regular expressions
Tokenisation
Case folding
Stemming
Lemmatization
Stop word removal
Text normalization
Noise removal
Text representation
Part of Speech Tagging
Bag-of-n-grams
Distributed representation learning
COMP20008 Elements of Data Processing
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.