IR H/M Course
Quest for Efficiency
Remember the scale of Web-scale search engines
Microsoft Bing uses hundreds of thousands of servers in their search engine
Copyright By Assignmentchef assignmentchef
Any new feature/retrieval technique should not increase the response time of the system as this might lead to a loss in revenues
Hence, deploying a retrieval technique that causes a 1% increase in response time implies 1000 additional servers must be activated in their data centre(s)
This has significant cost and Green impact
IR infrastructures are concerned with making effective yet efficient retrieval
Efficient infrastructure techniques
Static Pruning
Query Evaluation (TAAT & DAAT), and Dynamic Pruning Index Compression
Distributed IR infrastructure
Distributed Retrieval
Query Efficiency Prediction
IR H/M Course
Increasing Efficiency
SMALLER, BETTER, FASTER
Recall: The Format of an Index
An index normally contains three sub-structures
Lexicon: Records the list of all unique terms and their statistics
Document Index: Records the list of all documents and their statistics
Inverted Index: Records the mapping between terms and documents
Term Pipeline
Could also contain other document
information: e.g. PageRank
InvertedIndex
Could also contain other occurrence
information: e.g. term positions,
fields (title, URL) 6
DocumentIndex
id df cf p
id tf id tf
each entry (posting) represents a document
IR H/M Course
Recall: Search Efficiency
It is important to make retrieval as fast as possible
Research by indicates that even slightly slower retrieval (0.2s-0.4s) can lead to a dramatic drop in the perceived quality of the results [1]
So what is the most costly part of a (classical) search system?
Term Pipeline
Re-Ranking
Top Results
Document Retrieval Model
Scoring each document for the user query
[1] Teevan et al. Slow Search: Information Retrieval without Time Constraints. HCIR13
Why is Document Scoring Expensive?
The largest reason for the expense of document scoring is that there are lots of documents:
A Web search index can contain billions of documents Google currently indexes trillions of pages [1]
More specifically, the cost of a search is dependent on:
Query length (the number of search terms) Posting list length for each query term
i.e. The number of documents containing each term
term1 id df cf p id tf id tf id tf
term2 id df cf p id tf id tf [1] http://www.statisticbrain.com/total-number-of-pages-indexed-by-google/
IR H/M Course
Strategies to Speed
There are several enhancements to the search architecture that can make search more efficient
Search result/Term caching
Where possible avoid the scoring process altogether Pruning
Static Pruning: Skip the indexing of documents that are not likely to make the first few search result pages
Dynamic Pruning: Skip the scoring of documents that will not make the first few search result pages
Index compression
Reduce the time it takes to read a posting list
Discuss a number of strategies to speed-up the retrieval process along with their pros and cons.
IR H/M Course
Search Result/Term Caching
Caching strategies are built on the idea that we should store answers to past queries
Past answers can be used to bypass the scoring process for subsequent queries
Forpopularqueries,cachingisverybeneficial There are two types of caching strategy
Search Result Caching stores the final ranked list of documents for a query
TermCachingstoresthepostinglistsforeachofthequerytermsinmemory
Caching is beneficial to efficiency [1]:
SearchResultcachingcanavoidscoringfor50%ofqueriessocalledhead queries
Termcachingcanavoidloadingoneormorepostinglistsfor88%ofqueries
[1] Baeza-Yates et al. The impact of caching on search engines. SIGIR07 11
Search Result/Term Caching
Search Result Caching Architecture
Cache Miss
Term Pipeline
Result Cache
Document Retrieval Model
Re-Ranking
Term Caching Architecture
Cache the results for the query
Additional term posting lists are cached
Get Term Posting List(s)
Term Pipeline
Posting Cache
Document Retrieval Model
Re-Ranking
Discuss the pros & cons of 2 caching techniques.
All term posting lists are cached
Only some term posting lists are cached
IR H/M Course
More on Caching
Memory is cheap
The logical consequence is that many search engines keep the entire index in memory
SSDs and hard drives offer slower storage tiers
For many queries, phrases can be used to help the ranking
If we had bigram posting lists, we could score these queries much quicker
But we cannot store postings for all bigrams
Instead, frequent bigrams from the query log can be selected, and then SSD and disk space can be used to cache and store these term pair posting lists [1]
Then decide on a per-query basis to use them or not [2]
[1] Yan et al. Efficient Term Proximity Search with Term-Pair Indexes. CIKM 2010
[2] Risvik et al. Maguro, a system for indexing and searching over very large text collections. WSDM 2013
Paired Posting Lists
Consider the example query white house from Lect. 8
We can retrieve for the phrase white house by intersecting
the inverted index posting lists for white and house, calculating a pair frequency (pf) for each document
0, <1,5> 5, <3>
0, <2,6> 3, <2,4,6>
6, <4> 6, <5>
white house
In essence, we can simulate a posting list for white house, without indexing it as a bigram
Then if white house occurs frequently in the query stream, it can be cached, e.g. to SSD
IR H/M Course
STATIC PRUNING
K is usually very small viz. the size of the collection (N)
K << 1000, i.e. 20 for the first page of results Even if we are re-ranking results, K << N, e.g. K=5000Question: do we really need to keep ALL of the inform- ation in the index for a good-quality top-K search for common queries? There should be a way to remove some of the less important data, while (hopefully) retaining the quality of top-K results!What is (not) important? Will some documents never be retrieved, for any query? Will some terms never help retrieval?K Retrieval IR H/M CourseStatic Pruning ConceptsDocument-based pruning: Discards terms from a document that are less representative of a documents content Can be applied on-the-fly during indexing, IF we have reasonable collection statisticsTerm-based pruning: Discards term occurrences that are less likely to affect the retrieval performance of a specific weighting model (e.g. BM25) Done after the index is completed Pruning AlgorithmForeach common query: Run it against the full index Record the top-K matching documents Foreach document: Record the terms and term positions that contributed to the scoreFinally: remove all non-recorded postings and termsFirst proposed by D. Carmel (2001) for single term queries What is static pruning? Discuss 2 possible static pruning techniques.IR H/M CourseStatic Pruning Advantages & DisadvantagesSome results claim a modest negative impact on effectiveness when pruning up to 60% of postings Static pruning is a lossy compression of the inverted index Index is smaller; retrieval is faster; pruning is done offlineAn application of static pruning is a multi-tier architecture: Most common queries are handled by a heavily pruned 1st tier index that fits wholly in RAM Less common queries handled by a 2nd tier index (on SSD?) Remaining queries handled by a full index on diskQUERY EVALUATION & DYNAMIC PRUNINGIR H/M Course Query EvaluationEven when using caching and/or static pruning, retrieval still needs to score many documents in the index i.e. when the query/query terms are not in the cache Normal strategies make a pass on the postings lists foreach query term This can be done Term-at-a-Time (TAAT) one query term at a time Or Document-at-a-time (DAAT) all query terms in parallelI will explain these, before showing how we can improve themCompare and contrast the TAAT and DAAT query evaluation techniques.term1 p term2 pTime (TAAT)21 2 25 2 21 2IR H/M Courseterm1 p term2 pTime (TAAT)21 2 25 2 21 2term1 p term2 pAdvantages: Simple Disadvantages:Time (TAAT)21 2 25 2 Requires lots of memory to contain partial scores for all documents Difficult to do Boolean or phrase queries, as we dont have all postings for a given document at the same timeIR H/M Courseterm1 p term2 pTime (DAAT)term1 p term2 pAdvantages:Time (DAAT) Reduced memory compared to TAAT (and hence faster) Supports Boolean query operators, phrases, etc. Disadvantages: Slightly more complex to implementMost commercial search engines are reported to use DAATIR H/M Course Dynamic Pruning during Query EvaluationDynamic pruning strategies aim to make scoring faster by only scoring a subset of the documents The core assumption of these approaches is that the user is only interested in the top K results, say K=20 During query scoring, it is possible to determine if a document cannot make the top K ranked results Hence, the scoring of such documents can be terminated early, or skipped entirely, without damaging retrieval effectiveness to rank K We call this safe-to-rank K Dynamic Pruning TechniquesThe two most well-known methods for DAAT dynamic pruning are MaxScore and WANDMaxScore [1] Early termination: does not compute scores for documents that wont be retrieved by comparing upper bounds with a score thresholdApproximate evaluation: does not consider documents with approximate scores (sum of upper bounds) lower than thresholdTherefore, it focuses on the combinations of terms needed (wAND) for a document to be retrieved[1] H Turtle & J Flood. Query Evaluation : Strategies and Optimisations. IPM: 31(6). 1995.[2] A Broder et al. Efficient Query Evaluation using a Two-Level Retrieval Process. CIKM 2003.IR H/M CourseRequire K=20 documents Weighting model BM25term scores determined using upper bounds Could make top KRequire K=20 documents Weighting model BM25term score calculated using BM25PRUNE: Cant make top K!IR H/M CourseRequire K=20 documents Weighting model BM25term scores determined using upper bounds PRUNE: Cant make top K! WAND ExampleRequire K=20 documents Weighting model BM25WAND focuses on the combinations of terms needed (c.f. Weighted AND) to reach the threshold With threshold 4.5, any document without term1 cannot make the retrieved set. Hence, we can skip docid 25 in the term2 posting list Hence, it will focus retrieval using term1, and only score term2 for documents that could exceed the thresholdFor both MaxScore & WAND, smaller K => faster retrieval
IR H/M Course
Some numbers
To demonstrate the benefit of dynamic pruning, we report experiments from [1]
Retrieve K=1000 document for BM25 on the ClueWeb09 collection 1000 real search engine queries
This is for safe-to-rank 1000. Both WAND & MaxScore can be configured to be faster, but unsafe, i.e. permit losses in effectiveness above rank K
This is achieved by over-inflating the threshold
There are also unsafe dynamic pruning techniques for TAAT
Overall, dynamic pruning is an important component of modern search engine deployments
Query Evaluation
Mean Response Time
Postings Scored
Exhaustive DAAT
[1] N Tonellotto, C Macdonald, and I Ounis. Effect of different docid orderings on dynamic pruning retrieval strategies. SIGIR 2011.
INDEX COMPRESSION
IR H/M Course
Index Compression
The previous approaches make retrieval faster by scoring fewer documents
However it is also possible to make the scoring of each document faster! This can be achieved by applying index compression [1]
Motivation: it physically takes time to read the term posting lists, particularly if they are stored on a (slow) hard disk
Using lossless compressed layouts for the term posting lists can save space (on disk or in memory) and reduce the amount of time spent doing IO
But decompression can also be expensive, so efficient decompression is key! 1 integer = 32 bits = 4 bytes
term2 21 2 25 2 26 5
total = 24 bytes
Do we need 32 bits?
[1] Witten et al. Managing Gigabytes: Compressing and Indexing Documents and Images.
Delta Gaps
One component of the information stored in a posting list is the docids
in ascending order!
term2 21 2 25 2 26 5
We can make smaller numbers by taking the differences
term2 21 2 245 2 26 5 1
So each docid in the posting lists could be represented using less bits
How to represent these numbers?
32 bits has a range -2147483648 .. 2147483648
Using a fixed number of bits is wasteful (192 bits = 24 bytes) 36
IR H/M Course
& Gamma Encoding
Use as many 0s as the input value x, followed by a 1
E.g.: 5 is 000001
Let N=log2 x be the highest power of 2 that x contains;
Write N out in unary representation, followed by the remainder
(x 2N) in binary
E.g.: 5 is represented as 00101
Lets represent docids as gamma, and tf as unary
(This is the default compression used by Terrier)
21 2 245 2 216 5
Achieved Compression Rate = 4.5 bits/int
= 27 bits (< 4 bytes), down from 24 bytes! 0000 1 0101Consider the following posting list:term4 14 3 15 2 17 1 id=14 tf=3 id= 15 tf=2 id=17 tf=1Encode the posting list using Unary encoding, and calculate the achieved compression rateFirstly take the delta-gaps14,3,1,2,2,1Encode each in unary:000000000000001 0001 01 001 001 01Encoding 6 integers as 32bit integers = 192 bits (i.e. 32 bits per integer) Encoding 6 integers as using Unary = 29 bits =~ 4.8 bit per integer IR H/M CourseOther Compressions Schemes& are moderately expensive to decode: lots of bit twiddling! Other schemes are byte-aligned, e.g. Variable byte [1]Simple family [2]Documents are often clustered in the inverted index (e.g. by URL ordering) Compression can be more effective in blocks of numbers List-adaptive techniques work on blocks of numbers Frame of reference (FOR) [3]Patched frame of reference (PFOR) [4][1] H.E. Williams and J. Zobel. Compressing Integers for Fast File Access. Comput. J. 1999[2] V. Anh & A. Moffat. Inverted Index Compression using Word-aligned Binary Codes. INRT. 2005[3] J. Goldstein et al. Compressing relations and indexes. ICDE 1998.[4] M. Zukowski et al. Super-scalar RAM-CPU cache compression. ICDE 2006. Frame of ReferenceIdea: pick the minimum m and the maximum M values of the block of numbers that you are compressing. Then, any value x can be represented using b bits, where b = log2(M-m+1). Example: To compress numbers in range {2000,…,2063} log2(64) = 6 So we use 6 bits per value: 2000 6 xxxxxxxxxxxxxxxxxx… 22 IR H/M Course Compression: Some numbers [1]ClueWeb09 corpus 50 million Web documents 12,725,738,385 postings => 94.8GB inverted file uncompressed NO retrieval numbers: WAY TOO SLOW!
Terriers standard/Unary compression = 15GB
Compression is essential for an efficient IR system
List adaptive compression: slightly larger indices, markedly faster
[1] M Catena, C Macdonald, and I Ounis. On Inverted Index Compression for Search Engine Efficiency. ECIR 2014. Award.
Time (s) Size
Gamma/Unary
Variable Byte
Efficient Query Evaluation
Caching, Pruning & Compression all form essential aspects of an efficient IR system
Each provides important improvements to efficiency
Some techniques like static pruning or unsafe dynamic
pruning can degrade effectiveness
A search engine implementer must be aware of the tradeoffs to
achieve their desired effectiveness within cost constraints
Other state-of-the-art approaches I havent included:
Impact ordered posting lists: an alternative index layout Requires efficiency/effectiveness tradeoff
Block- AND: integrates WAND more tightly with the index compression format
IR H/M Course
Scaling Up
WIDE SYSTEMS
Scaling Up Information Retrieval
Even with the aforementioned efficiency improvements, indexing and search is computationally and (disk) IO intensive
To satisfy high query loads, the retrieval process needs to be spread over many CPUs and hard disks
There are 2 main paradigms to scale up:
Vertical: Buy a large mainframe machine with lots of CPU cores and storage
Horizontal: Buy many machines and distribute the search process over them
IR H/M Course
Vertical vs. Horizontal Scaling
Vertical Scaling
Advantages
All resources are local to the
processing
Some applications do not lend themselves to a distributed computing model
Disadvantages
Expensive infrastructure
Fault tolerance is hard to achieve
Horizontal Scaling
Advantages
Nodes can be added in an ad- hoc manner as processing power is needed
Multi-core processing nodes are inexpensive
Disadvantages
Additional communication and coordination overheads are incurred
Parallelised Indexing and Retrieval
Horizontal scaling is used by large search engines to parallelise the indexing process and the index itself
Spread the index out into shards running on many machines Replicate each shard multiple times to allow for multiple
queries to be processed in parallel and for fault tolerance
Index Replicated Shards Shards
In the following, we cover distributed retrieval
IR H/M Course
Distributed Retrieval Architectures (1)
So how do we partition data between nodes?
t1 t2 t3 t4
Option 1: Term Partitioning
Each row represents a document dj and each column represents an indexing t
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.