Text preprocessing
School of Computing and Information Systems
@University of Melbourne 2022

Text processing for machine learning
• Text is a sequence of characters.
• ML algorithms understand numeric vectors. • The aim
“Cars are driven on the road.”
Text processing for machine learning
• Features retain as much meaning/information as possible • Reduce the sparsity of the feature vector
“Cars are driven on the road.”
Preprocessing – Tokenisation
• Granularity of a token • Sentence
• Token separators
• “The speaker did her Ph.D. in Germany. She now works at UniMelb.” • “The issue—and there are many—is that text is not consistent.”
Preprocessing – Tokenisation
• Split continuous text into a list of individual tokens
• English words are often separated by white spaces but not always • Tokens can be words, numbers, hashtags, etc.
• Can use regular expression
Preprocessing – Case folding
• Convert text to consistent cases
• Simple and effective for many tasks
• Reduce sparsity (many map to the same lower-case form) • Good for search
I had an AMAZING trip to Italy, Coffee is only 2 bucks, sometimes three! The coffee is so nice.
i had an amazing trip to italy, coffee is only 2 bucks, sometimes three! the coffee is so nice.
Preprocessing – Stemming
• Words in English are derived from a root or stem inexpensive → in+expense+ive
Driving my car
Taking a drive in my car He drives my car
Preprocessing – Stemming
• Stemming attempts to undo the processes that lead to word formation • Remove and replace word suffixes to arrive at a common root form
• Result does not necessarily look like a proper ‘word’
• Porter stemmer: one of the most widely used stemming algorithms
• suffix stripping (Porter stemmer) • sses → ss
• tional → tion • tion→t
Preprocessing – Stemming
troubles à troubl troubled à troubl trouble à troubl
Preprocessing – Lemmatization
• To remove inflections and map a word to its proper root form (lemma)
• It does not just strip suffixes, it transforms words to valid roots: running à run
runs à run ran à run
• Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.
Preprocessing – Stop Word Removal
• Stop words are ‘function’ words that structure sentences; they are low information words and some of them are very common
• ‘the’, ‘a’, ‘is’,…
• Exclude them from being processed; helps to reduce the number
of features/words
• Commonly applied for search, text classification, topic modelling, topic extraction, etc.
• A stopword list can be custom-made for a specific context/domain
Stop Word Removal
Text Normalisation
• Transforming a text into a canonical (standard) form
• Important for noisy text, e.g., social media comments, text
• Used when there are many abbreviations, misspellings and out- of-vocabulary words (oov)
2moro à tomorrow 2mrw à tomorrow tomrw à tomorrow B4 à before
Noise Removal
• Remove unnecessary spacing
• Remove punctuation and special characters (regular expressions) • Unify numbers
• Highly domain dependent
So far… Unstructured Text Data
• Text search – approximate string matching
• Preprocessing
– Regular expressions
– Tokenisation
– Case folding
– Stemming
– Lemmatization
– Stop word removal
– Text normalization
– Noise removal
Text representation – I
School of Computing and Information Systems
@University of Melbourne 2022

Sepal length
Sepal width
Petal length
Petal width
Species (label)
Iris setosa
Iris versicolor
Iris setosa
Iris virginica
How To Represent Text?
Text Representation – BoW
• Bag-of-words: simplest vector space representational model for text
• Disregards word order and grammatical features such as POS
• Each text document as a numeric vector
• each dimension is a specific word from the corpus
• the value could be its frequency in the document or occurrence (denoted by 1 or 0).
Prepare for BoW
• Word tokenisation
• Case-folding
• Abstraction of number (#num#, #year#)
Prepare for BoW
• Stop word removal
Prepare for BoW
• Stop word removal
Prepare for BoW
• Stemming
• How would this look different if we lemmatised instead?
• Removed punctuation
• Word counts
Bag of Words
Term Frequency
• What if a rare word occurs in a document?
• e.g. ‘catarrh’ is less common than ‘mucus’ or ‘stuffiness’
• What if a word occurs in many documents?
• Maybe we want to avoid raw counts?
• Raw frequencies varies with document length
• Don’t capture important (rare) features that would be telling of a type of document
Text representation – II
School of Computing and Information Systems
@University of Melbourne 2022

Raw Frequencies
• What are the problems?
• What are the alternatives?
COMP20008 Elements of Data Processing

Raw Frequencies
• What are the problems?
• What are the alternatives?
play grace crowd
play grace audience
COMP20008 Elements of Data Processing

Discourse on Floating Bodies
– Galileo Galilei
Treatise on Light
– Christiaan Huygens
Experiments with Alternate Currents of High Potential and High Frequency
– Nikola Tesla
Relativity: The Special and General Theory

• TF-IDF stands for Term Frequency-Inverse Document Frequency
• Each text document as a numeric vector
• each dimension is a specific word from the corpus
• A combination of two metrics to weight a term (word)
• term frequency (tf): how often a given word appears within a document
• inverse document frequency (idf): down-weights words that appear in many documents.
• Main idea: reduce the weight of frequent terms and increase the weight of rare and indicative ones.
Term frequency (TF):
• !” !, $ = the raw count of a term in the document. Inverse Document Frequency (IDF):
•&$”! =ln !”# +1or&$”! =ln # +1 !”$%! $%!
• N is the number of document in the collection,
• $”& is the document frequency, the number of document containing the term t.
TF-IDF (L2 normalised): • !”_&$” !,$ = ‘!
∑!”∈$ ‘!”%
Example TF-IDF
Two documents: A – ‘the car is driven on the road’
B – ‘the truck is driven on the highway’
-./(1) = 34 !”# + 1 !”$%!
)* ),, ×7,* ) =9&
1/_-./ 1,.
ln ‘( +1=1.405
ln ‘( +1=1.405
ln ‘( +1=1.405
ln ‘( +1=1.405
Example TF-IDF
Example TF-IDF – cont.
• Two documents, A and B.
A. ‘the car is driven on the road’
B. ‘the truck is driven on the highway’
* stop words removed
• Text features for machine learning
• 3 documents:
A: ‘the car is driven on roads’
B: ‘the truck is driven on a highway’
C: ‘a bike can not be ridden on a highway’
* stop words removed
34 !”# +1 !”$%!
9& = )* ),, ×7,* )
1/−-./ 1,.
ln4/2 +1=1.6931.693
↳ (4/3)+1=1.288
! “!!$ ! “!!$

Features from unstructured text
Features for structured data
TF-IDF features for unstructured text
Revision – Finding similar texts
• Edit distance
• N-gram distance
• Jaccard similarity
• Sørensen-Dice similarity
• Cosine similarity d1
cos d!,dB =
C&⋅C% ~0.202 C& × C%
So far… Unstructured Text Data
• Text search – approximate string matching
• Preprocessing
– Regular expressions
– Tokenisation
– Case folding
– Stemming
– Lemmatization
– Stop word removal
– Text normalization
– Noise removal
• Text representation
Other Text Features
• Part-of-speech tagging
• She saw a bear. bear: NOUN • Your efforts will bear fruit. bear: VERB • bear_NN; bear_VB
• N-grams (bag of n-grams)
Advanced text representation
• Surface value vs Semantics • ‘A is better than B’
• ‘B is better than A’
• ‘Distributed Representations of Sentences and Documents’ Quoc Le and (Doc2Vec)
• Text search – approximate string matching
• Preprocessing
– Regular expressions
– Tokenisation
– Case folding
– Stemming
– Lemmatization
– Stop word removal
– Text normalization
– Noise removal
• Text representation
– Part of Speech Tagging
– Bag-of-n-grams
– Distributed representation learning
