CE306/CE706 Information Retrieval
Ad-hoc Search & Processing Pipeline
Dr Alba Garcia Seco de Herrera
Brief Module Outline (Reminder)
Motivation + introduction
Processing pipeline / indexing + query processing
Large-scale open-source search tools
Information Retrieval models
Evaluation
Log analysis
User profiles / personalisation / contextualisation
IR applications, e.g. enterprise search
Main Information Retrieval Tasks
Ad-hoc retrieval:
relatively static document collection
different user queries
examples: library catalogues, Web search engines
Filtering/Routing:
relatively static user requests
constantly updated document collection
examples: news filtering, classification, spam filter
Ad-hoc Retrieval Text-based retrieval
Given a query and a corpus, find relevant items
query: textual description of information need corpus: a collection of textual documents
relevance: satisfaction of the users information need
Reminder
Or
Ad-hoc Retrieval
Start with traditional document collection, i.e. treat document as plain text and not HTML document
Look at how we match a query against a collection of documents
To do so we need to process the document collection into an index structure that allows fast access and search
So its actually more like this:
or as Moodle would put it:
Why start with plain Documents?
Main processing/indexing pipeline is the same
Transparent process:
not constantly tweaked (e.g., links) or tuned using user-interaction data (e.g., clicks) as in Web search
Very common:
government and corporate intranets (enterprise search), social media, your own personal computers (e.g., Word and PDF documents)
Basic IR Process
Basic IR Process
Document representation
1
Most Basic View of a Search
Engine
A search engines does not scan each document to see if it satisfies the query
It uses an index to quickly locate the relevant documents
Index: a list of concepts with pointers to documents that discuss them
Index from Manning et al., 2008
Most Basic View of a Search Engine
So, what goes in the index is important! How might we combine concepts, e.g.:
patent search + A/B test ? A/B test or user test ?
Most Basic View of a Search Engine
Most Basic View of a Search Engine
Most Basic View of a Search Engine
Document Representation
Document representation: deciding what concepts should go in the index
Option 1 (controlled vocabulary): a set a manually constructed concepts that describe the major topics covered in the collection
Option 2 (free-text indexing): the set of individual terms that occur in the collection
Document representation
Index from Manning et al., 2008
If we view option 1 and option 2 as two extremes, where does this particular index fit in?
Document Representation: (option 1: controlled vocabulary)
Think of examples in which controlled vocabularies are being used
Why would this be done?
What are the pros and cons compared to indexing?
Document = Bag of Words
(i.e., option 2: free-text indexing)
Need to locate important information in a document
Deep natural language understanding too expensive
Common assumption: document is a list of words
Documents represented by a set of index terms
The search engine will determine which terms are important (depends on the retrieval model applied)
Free-text Indexing:
Retrieval Process
1. Define text database, i.e. data sources, operations on text and text structure
2. Process document collection into index database for fast access
3. Parse user query using same text processing operations
4. Apply query operations to processed query
5. Generate system query to retrieve matching
documents
6. Rank and display results
Free-text Indexing
Free-text Indexing what you see
Free-text Indexing what your computer sees
Gerard Salton(8 March 1927 in Nuremberg 28 August 1995), also known as Gerry Salton, was a Professor of Computer Scienceat Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrievalduring his time. His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard.
Free-text Indexing mark-up vs. content
Gerard Salton(8 March 1927 in Nuremberg 28 August 1995), also known as Gerry Salton, was a Professor of Computer Scienceat Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrievalduring his time. His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard.
Free-text Indexing
mark-up
Describes how the content should be presented
e.g., your browser interprets html mark-up and presents the page as intended by the author
Can also define relationships with other documents (e.g., hyperlinks)
Can provide evidence of what text is important for search
It may also provide useful, unseen information!
Remember: we ignore mark-up for now
Free-text Indexing Step 1: mark-up removal
Gerard Salton(8 March 1927 in Nuremberg 28 August 1995), also known as Gerry Salton, was a Professor of Computer Scienceat Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrievalduring his time. His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard.
Remove HTML mark-up
Free-text Indexing Step 1: mark-up removal
Gerard Salton(8 March 1927 in Nuremberg 28 August 1995), also known as Gerry Salton, was a Professor of Computer Scienceat Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrievalduring his time. His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard.
Result is plain text
Free-text Indexing Step 2: tokenization
Gerard Salton( 8 March 1927 in Nuremberg 28 August 1995 ) , also known as Gerry Salton , was a Professor of Computer Scienceat Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrievalduring his time . His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard .
Splitting text into white-space separated `words (Possibly) remove punctuation altogether
Not that simple (e.g., ambiguity of punctuation)! Canada++ > Canada? Ph.D > Ph D ?
Free-text Indexing Step 3: case-folding
gerard salton( 8 March 1927 in Nuremberg 28 August 1995 ) , also known as gerry salton , was a professor of computer scienceat cornell university. salton was perhaps the leading computer scientist working in the field of information retrievalduring his time . his group at cornell developed the smart information retrieval system, which he initiated when he was at harvard .
Typical representation: lower case
London -> london LONDON > london
But: SMART -> smart March > march
Free-text Indexing Step 4: stop word removal
gerard salton 8 march 1978 nuremberg 28 august 1995 known gerry salton professor computer science cornell university salton leading computer scientist working field information retrieval time group cornell developed smart information retrieval system initiated harvard
Stopwords: words that we choose to ignore because we expect them to not be useful in distinguishing between relevant/non-relevant documents for any query
Very frequent words for example (see Zipfs law later on)
Closed class vs. open class words
(e.g. determiners and prepositions vs. nouns and verbs)
But: `To be or not to be
Free-text Indexing Final step
Apply the pipeline of processing steps to every document in the collection and create an index using the union of all remaining terms
We will look at the structure of the index another week
As we will see, different structures are possible
Different models to match a query against the index, e.g. Boolean, Vector Space, Probabilistic (next time)
Note: the same processing steps need to be applied to the query that gets submitted!
Free-text Indexing Other possible steps
Morphological analysis e.g. initiated > initiate
Stemming
e.g. computer/computing > comput
Identification of phrases
e.g. leading computer scientist
Free-text Indexing Other possible steps
Named-entity recognition e.g. cornell university
Relation extraction
e.g. is(gerry salton, computer scientist)
lets look at some issues in more detail (more to come watch out for the assignments!)
Free-text Indexing Steps Expanded Zipfs Law
Free-text Indexing Steps Expanded Zipfs Law
Distribution of word frequencies very skewed
A few words occur very often, many words hardly ever occur,
e.g., two most common words (the, of) make up about 10% of all word occurrences in text documents
Rank (r) of a word times its frequency (f) is approximately a constant (k), i.e.
f*r=k
Free-text Indexing Steps Expanded Zipfs Law
Top 50 Words from a collection of news articles (AP89)
Free-text Indexing Steps Expanded Zipfs Law
AP89: distribution of words (rank-ordered) on a logarithmic scale
Free-text Indexing Steps Expanded Zipfs Law
Alice in Wonderland
(text available via Project Gutenberg)
Free-text Indexing Steps Expanded Zipfs Law
European Parliament: Hungarian (transcribed speech, see http://homepages.inf.ed.ac.uk/pkoehn/ )
Free-text Indexing Steps Expanded Zipfs Law
European Parliament: Hungarian (transcribed speech, see http://homepages.inf.ed.ac.uk/pkoehn/ )
Free-text Indexing Steps Expanded Morphology
Many morphological variations of words (with same or similar meaning)
inflectional: buy, bought, buying
derivational: computer, compute, computable But more complicated in languages other
than English, e.g.
Helsingin sanomat > Helsinki +GEN + sanoma
+PLURAL
Gaststattenneueroffnungsuntergangsgewissheit
(Restaurant+new+opening+failure+certainty)
(Ben Schott Schottenfreude: German Words for the Human Condition, John Murray (Publishers), 2013).
Free-text Indexing Steps Expanded Stemming
Generally a small but significant effectiveness improvement
Can be crucial for some languages, e.g., 5-10% improvement for English, up to 50% in Arabic
Words with the Arabic root ktb
Free-text Indexing Steps Expanded Extracting phrases
Many queries are 2-3 word phrases
Phrases are more meaningful, less ambiguous and more precise than single words
e.g. exam timetable vs. timetable
Often via part-of-speech (POS) tagging as
the initial step
Alternatively simply word n-grams
Free-text Indexing Steps Expanded Part-of-speech Tagging
POS taggers use statistical models of text to predict syntactic tags of words
Phrases can then be defined as simple noun groups, for example
Free-text Indexing Steps Expanded Part-of-speech Tagging
Examples Noun Phrases
Free-text Indexing
Some possible implementations
Mark-up removal
e.g., jsoup, Beautiful soup
Tokenization:
regular expression matching
Case-folding:
regular expression (matching+substitution)
Stopword removal: stopword lists
Free-text Indexing
Some possible implementations
Stemming:
e.g. Porter stemmer
Part-of-speech tagging, NER: e.g., Stanford tools
Phrase extraction:
POS tagger + rule-based POS patterns
Relation extraction:
e.g., Stanford tools, GATE
Document Representation controlled vocabulary vs. free-text indexing
Cost of assigning index terms
Ambiguity of index terms
Detail of
representation
Controlled Vocabularies
High/Low?
Ambiguous/ Unambiguous?
Can represent arbitrary level of detail?
Free- text Indexing
High/Low?
Ambiguous/ Unambiguous?
Can represent arbitrary level of detail?
Document Representation controlled vocabulary vs. free-text indexing
Cost of assigning index terms
Ambiguity of index terms
Detail of
representation
Controlled Vocabularies
High
Not ambiguous
Cant represent arbitrary detail
Free- text Indexing
Low
Can be ambiguous?
Any level of detail
Both are effective and used often
We will focus on free-text indexing
cheap and easy
most search engines use it (even those that adopt a
controlled vocabulary)
Document representation
1
Document representation
2
Document representation
3
Document representation
4
Summary
We focus on ad hoc retrieval
Documents need to be indexed before they can be searched
Two approaches to indexing:
controlled vocabulary vs. free-text indexing
Free-text indexing is composed of sequence of typical processing steps (same steps applied to query)
Then compare (processed) query with index database
Next: how do we match query against index? Ranking?
Summary
Here is a recent example that combines controlled vocabulary and free-text indexing (read up about it and find out exactly how the two approaches are combined and why):
Reading
Chapter 2
Chapter 4.1 and 4.3 CE706:
John Justeson and Slava Katz:Technical Terminology: some linguistic properties and an algorithm for identification in text, Natural Language Engineering, 1(1):927, 1995. http://journals.cambridge.org/action/displayAb stract?fromPage=online&aid=1313044
Acknowledgements
Thanks again to contain substantial material prepared by Udo Kruschwitz and Jaime Arguello
Additional material as provided by Croft et al. (2015)
Reviews
There are no reviews yet.