CE306/CE706 Information Retrieval
Evaluation
Dr Alba Garcia Seco de Herrera
Brief Module Outline (Reminder)
Motivation + introduction
Processing pipeline / indexing + query processing
Large-scale open-source search tools
Information Retrieval models
Evaluation
Log analysis
User profiles / personalisation / contextualisation
IR applications, e.g. enterprise search
Information Retrieval Evaluation
Questions that arise:
How good is a particular IR system?
How does it compare to alternative systems?
What is the best system for a particular application?
Information Retrieval Evaluation
Some measures that might help: Accuracy
User satisfaction Price
Response time Search time
???
Information Retrieval Evaluation Evaluation is key to building effective and
efficient search engines
measurement usually carried out in controlled
laboratory experiments
online testing can also be done
Effectiveness, efficiency and cost are related
e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration
efficiency and cost targets may impact effectiveness
Information Retrieval
Given a query and a corpus, find relevant documents
query: users textual description of their information need
corpus: a repository of textual documents relevance: satisfaction of the users
information need
we will focus on effectiveness measures
IR Effectiveness
Precision = tp/(tp+fp) Recall = tp/(tp+fn)
IR Effectiveness (put differently)
A is a set of relevant documents B is a set of retrieved documents
Precision and Recall
!#$%&%'( = *+.-./.0123 4+56 7.3872.4 *+.4+56 7.3872.4
9#$:;; = *+.7./.0123 4+56 7.3872.4 <+31/ *+.7./.0123 4+56 Precision indicates what proportion of the documents returned are relevant Recall indicates what proportion of all relevant documents are recalled Both precision and recall vary between zero and one 9 Should we use the accuracy? Given a query, an engine classifies each doc as Relevant or Nonrelevant The accuracy of an engine: the fraction of these classifications that are correct (tp + tn) / ( tp + fp + fn + tn) Accuracy is a commonly used evaluation measure in machine learning classification workSec. 8.3 10Information Retrieval Evaluation Evaluation is a fundamental issue of information retrieval an area of IR research in its own right (much of it due to uncertainty/ambiguity) Often the only way to show that A is better than B, is through extensive experimentation Evaluation methods: batch evaluation user-study evaluation online evaluation Each method has advantages and disadvantagesBatch Evaluation Overview Collect a set of queries (to test average performance) Construct a more complete description of the information being sought for each queryBatch EvaluationOverview: query + description (example) QUERY: pet therapy DESCRIPTION: Relevant documents must include details of how pet- and animal-assisted therapy is or has been used. Relevant details include information about pet therapy programs, descriptions of the circumstances in which pet therapy is used, the benefits of this type of therapy, the degree of success of this therapy, and any laws or regulations governing it…. a TREC example (more on that later) Batch Evaluation Overview Using these descriptions, have human judges determine which documents are relevant for each query Evaluate systems based on their ability to retrieve the relevant documents for these queries evaluation metric: a measurement that quantifies the quality of a particular ranking of results with known relevant/non-relevant documents Batch Evaluation Overview: metrics Which ranking is better? rank of the first relevant documentBatch Evaluation Overview: metrics Which ranking is better? precision at rank 10 Batch Evaluation Overview: metrics Which ranking is better? precision at rank 1 Batch Evaluation Overview: metrics Which ranking is better? recall at rank 10 Batch Evaluation Overview: metrics Which ranking is better? precision at rank 30 Batch Evaluation Overview: trade-offs Advantages: inexpensive (once the test collection isconstructed) the experimental condition is fixed; same queries, and same relevance judgements evaluations are reproducible; keeps us honest by experimenting on the same set of queries and judgements, we can better understand how system A is better than BBatch Evaluation Overview: trade-offs Disadvantages: high initial cost. human assessors (the ones who judge documents relevant/non-relevant) are expensive human assessors are not the users; judgements are made out of context assumes that relevance is the same, independent of the user and the users context Batch Evaluation Overview: trade-offs Many factors affect whether a document satisfies a particular users information need Topicality, novelty, freshness, formatting, reading level, assumed level of expertise (remember?) Topical relevance: the document is on the same topic as the query User relevance: everything else Which kind of relevance does batch-evaluation address? Whether the document contains the sought-after information User-Study Evaluation Overview Provide a small set of users with several retrieval systems Ask them to complete several (potentially different) search tasks Learn about system performance by: observing what they do asking about their actions and thought processes measuring success (task completion, time, etc.) measuring perceived success (questionnaire data) Will present you an Essex-based example soon …User-Study Evaluation Overview: trade-offs Advantages: very detailed data about users reaction tosystems in reality, a search is done to accomplish a higher-level task in user studies, this task can be manipulated and studied in other words, the experimental starting- point need not be the query User-Study Evaluation Overview: trade-offs Disdvantages: user studies are expensive (payusers/subjects, scientists time, data coding) difficult to generalize from small studies to broad populations the laboratory setting is not the users normal environment need to re-run experiment every time a new system is considered Online Evaluation Overview Given a search service with an existing user population (e.g., Google, Yandex, Bing) … Have x% of query traffic use system A and y% of query-traffic use system B Compare system effects on logged user interactions (implicit feedback) clicks: surrogates for perceived relevance (good) skips: surrogates for perceived non-relevance (bad) Happens every time you submit a Web search query! Implicit feedback click!can we say thatthe first result is more relevant than the second? Implicit feedback skip!click!can we say that the second result is more relevant than the first? Implicit feedback skip!skip! skip!click!can we say that the fourth result is more relevant than the first/second/third? Implicit feedbackclick!a click is a noisy surrogate for relevance! Implicit feedbackuser sees the results and closes the browser Implicit feedbackthe absence of a click is a noisy surrogate for non- relevance On-line Evaluation Overview: trade-offs Advantages: system usage is naturalistic; users are situated in their natural context and often dont know that a test is being conducted evaluation can include lots of usersOn-line Evaluation Overview: trade-offs Disdvantages: requires a service with lots of users (enough of them to potentially hurt performance for some) requires a good understanding on how different implicit feedback signals correlate with positive and negative user experiences experiments are difficult to repeat Information Retrieval Evaluation Evaluation is a fundamental issue of IR an area of IR research in its own right Evaluation methods: batch evaluation user-study evaluation online evaluation Each method has advantages and disadvantages Evaluation Metrics Remember: we focus on effectiveness measures Many of the metrics are ultimately derivedfrom precision and recall For example, F-measure:Question 1: What is the trade-off between P & R ?Question 2: Why use the harmonic mean for F1 rather than the arithmetic mean?Ranked Retrieval Precision and recall In most situations, the system outputs a ranked list of documents rather than an unordered set User behaviour assumption: The user examines the output ranking from top-to-bottom until he/she is satisfied or gives up Precision/Recall @ rank K Ranked Retrieval Precision and recall: example Assume 20 relevant documentsKP@KR@K1(1/1) = 1.0(1/20) = 0.052(1/2) = 0.5(1/20) = 0.053(2/3) = 0.67(2/20) = 0.104(3/4) = 0.75(3/20) = 0.155(4/5) = 0.80(4/20) = 0.206(5/6) = 0.83(5/20) = 0.257(6/7) = 0.86(6/20) = 0.308(6/8) = 0.75(6/20) = 0.309(7/9) = 0.78(7/20) = 0.3510(7/10) = 0.70(7/20) = 0.35Ranked Retrieval P/R @ K Advantages: easy to compute easy to interpret Disadvantages: the value of K has a huge impact on themetric how do we pick K? Ranked Retrieval motivation: average precision Ideally, we want the system to achieve high precision for varying values of K The metric average precision accounts for precision and recall without having to set K Ranked Retrieval average precision1. Go down the ranking one-rank-at-a-time 2. If the document at rank K is relevant,measure P@K proportion of top-K documents that arerelevant3. Finally, take the average of P@K values the number of P@K values will equal the number of relevant documents Ranked Retrieval average precisionsum rank 1.00 1 1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9(K) ranking1.001.001.001.001.001.001.001.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00Number of relevant documents for this query: 10 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19 10.0020 0.00 total10.00 Ranked Retrieval average precisionsum rank 1.00 1 1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9(K)1. 2. 3.Go down the ranking ranking one-rank-at0.001.000.00 0.00- 0.00-timea1.00 1.001.00 1.00 1.00caculate P@K at 000.00.00 0.000 If recall goes up, l0.00 0. 0.007.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19that rank K 0.000.001.00 1.00When recall = 1.0, 0.001.00av0.00rage P@K valuese0.000. 10.0020 total00 0.00 0.00 0.00 0.001.00 10.00 Ranked Retrieval average precisionsum rank (K) 1.00 11.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 1910.00R@[email protected] 0.50 0.00 0.67 0.75 0.800.830.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.500.76ranking 1.001.001.001.001.001.000.10 0.100.200.300.400.500.600.601.000.670.750.800.83 1.000.860.70 0.700.78 1.000.800.800.800.73 1.000.900.900.900.900.90 0.900.64 201.000.50 1.00total 10.00average-precision Ranked Retrieval average precisionsum rank (K) 1.00 12.00 2 3.00 3 4.00 4 5.00 5 6.00 6 7.00 7 8.00 8 9.00 910.00 10 10.00 11 10.00 12 10.00 13 10.00 14 10.00 15 10.00 16 10.00 17 10.00 18 10.00 19 10.00 20R@KP@K ranking1.001.001.001.001.001.001.001.001.001.000.100.200.300.400.500.600.700.800.901.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.000.910.830.770.710.670.630.590.560.530.50 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 total 10.00average-precision1.00Ranked Retrieval average precisionsum rank (K) 0.00 10.00 2 0.00 3 0.00 4 0.00 5 0.00 6 0.00 7 0.00 8 0.00 9 0.00 10 1.00 11 2.00 12 3.00 13 4.00 14 5.00 15 6.00 16 7.00 17 8.00 18 9.00 1910.00 20total 10.00R@KP@K ranking1.001.001.001.001.001.001.001.001.001.000.000.000.000.000.000.000.000.000.000.000.100.200.300.400.500.600.700.800.901.000.000.000.000.000.000.000.000.000.000.000.090.170.230.290.330.380.410.440.470.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.09 0.17 0.23 0.29 0.33 0.38 0.41 0.44 0.47 0.50 average-precision0.33Ranked Retrieval average precisionsum rank (K) 1.00 11.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 1910.00 20total 10.00R@[email protected] 0.50 0.00 0.67 0.75 0.800.830.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.500.76 ranking 1.001.001.001.001.001.000.100.100.200.300.400.500.600.601.00 0.670.750.800.830.861.000.70 0.700.78 1.000.800.80 0.800.731.00 1.000.900.900.900.900.900.901.000.640.50 average-precisionRanked Retrieval average precisionsumrank (K)R@KP@K 12345678910121314151617181920ranking1.001.001.001.001.001.001.000.100.200.200.300.400.500.600.600.700.700.800.800.900.900.900.900.900.901.001.001.000.670.750.800.830.86 0.75 1.000.78 0.70 111.000.800.730.670.62 0.64 0.600.560.530.500.470.50 1.001.00 2.002.00ranks 2 and 31.00 1.00 0.00 0.75 0.80 0.83 0.86 0.00 0.78 0.00 0.73 0.00 0.00 0.64 0.00 0.00 0.00 0.00 0.00 0.50swapped 3.00 4.00 5.00 6.00 6.00 7.00 7.00 8.00 8.00 8.00 9.00 9.00 9.00 9.00 9.00 9.0010.00 total10.00average-precision0.79Ranked Retrieval average precisionsum rank (K) 1.00 11.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 1910.00 20total 10.00R@[email protected] 0.50 0.00 0.67 0.75 0.80 0.83 0.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.500.76 ranking 1.001.001.001.001.000.100.10 0.200.300.400.500.600.600.700.700.801.000.670.750.800.83 1.000.86 0.78 1.00 0.73 1.000.800.80 1.00 1.000.900.900.900.900.900.901.000.640.50average-precisionRanked Retrieval average precisionsum rank (K) 1.00 1R@[email protected] 0.50 0.00 0.67 0.75 0.800.83 0.86 0.88 0.000.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.500.77 ranking 1.001.001.001.001.001.001.000.10 0.100.200.300.400.500.600.700.700.701.000.670.750.800.830.860.880.781.000.800.800.800.73 1.00 1.000.900.900.900.900.900.901.000.640.501.00 2 swapped 2.00 3 ranks 8 and 9 3.00 4 4.00 5 5.00 6 6.00 7 7.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19 10.00 20total 10.00average-precision Ranked Retrieval average precision Advantages: no need to choose K accounts for both precision and recall mistakes at the top are more influential mistakes at the bottom are still accounted for Disadvantages: not quite as easy to interpret as P/R@K recall can be difficult to assess in large collections Ranked RetrievalMAP: mean average precision So far, weve talked about average precision for a single query Mean Average Precision (MAP): average precision averaged across a set of queries MAP is one of the most common metrics in IR evaluation Ranked Retrieval precision-recall curves In some situations, we want to understand the trade-off between precision and recall A precision-recall (PR) curve expresses precision as a function of recall Ranked Retrieval precision-recall curves: general idea Different tasks require different levels of recall Sometimes, the user wants a few relevant documents Other times, the user wants most of them Suppose a user wants some level of recall R The goal for the system is to minimize the number of false negatives the user must look at in order to achieve a level of recall R Ranked Retrieval precision-recall curvesRanked Retrieval precision-recall curves which system is better? Ranked Retrieval other effectiveness metrics In some retrieval tasks, we really want to focus on precision at the top of the ranking A classic example is Web search users rarely care about recall users rarely navigate beyond the first page of results Discounted Cumulative Gain (DCG) gives more weight to the top-ranked results Metric review set-retrieval evaluation: we want to evaluate the set of documents retrieved by the system, without considering the ranking ranked-retrieval evaluation: we want to evaluate the ranking of documents returned by the systemMetric Review set-retrieval evaluation precision: the proportion of retrieved documents that are relevant recall: the proportion of relevant documents that are retrieved f-measure: harmonic-mean of precision and recall a difficult metric to cheat by getting very high precision and abysmal recall (and vice- versa) Metric Review ranked-retrieval evaluation P@K: precision under the assumption that the top-K results is the set retrieved R@K: recall under the assumption that the top-K results is the set retrieved average-precision: average of P@K values for every K where recall increases DCG: ignores recall, considers multiple levels of relevance, and focuses on the top ranks Evaluation Campaigns Several large-scale (annual) evaluation campaigns exist that allow you to enter your own system and compare against others Metrics tend to be the ones we discussed (effectiveness) Examples: Text REtrieval Conference (TREC) … US-based Conference and Labs of the Evaluation Forum (CLEF) … European MediaEval Benchmarking Initiative for Multimedia Evaluation … European TREC Annual conference to evaluate IR systems (since 1992) Forum for IR researchers TREC data collections are large (e.g.ClueWeb09, ClueWeb12 – about 1 billion Webpages) Number of different tracks, examples at TREC-2021: Conversational Assistance Track Deep Learning Track Health Misinformation Track Podcasts Track CLEF Annual conference to evaluate cross language IR systems (since 2000) Forum for IR researchers Number of different tracks, examples at CLEF-2021: Answer Retrieval for Questions on Math Living Labs for Academic Search ImageCLEF: Multimedia Retrieval in Medicine, Lifelogging, and Internet Touche: Argument Retrieval LifeCLEF: Multimedia Retrieval in NatureMediaEval Annual conference to evaluate multimedia IR systems (since 2000) The “multi” in multimedia: speech, audio, visual content, tags, users, context Number of different tracks, examples at MediaEval 2020: FakeNews: Corona virus and 5G conspiracy Predicting Media Memorability Insight for Wellbeing: Multimodal personal health lifelog data analysis Emotion and Theme recognition in music using Jamendo Caption task@ImageCLEF2021 Goal: Image understanding by aligning visual content and textual descriptors to interpret medicalC0006141: Breast C0024671: MammographyC0006826: Malignant Neoplasms C0772294: Epinastine Hydrochloride C0242114: SuspicionC4282132: MalignancyC0221198: LesionMailing list: groups.google.com/d/forum/imageclefcaption Coral task@ImageCLEF2021 Goal: Coral Reef Image Annotation and LocalisationMailing list: https://groups.google.com/d/forum/imageclefcoralPredicting media memorability@Mediaeval2020 Goal: predicting how memorable a video is to viewers Textual captionhttps://multimediaeval.github.io/editions/2020/ tasks/memorability/Evaluation Campaigns Systematic and quantitative evaluation Shared tasks on shared resources Uniform scoring procedures, i.e. sameconditions for each participant: Collection of documents Example information requests Relevant documents for each request Evaluation Campaign Tools we need Uniform scoring procedures, i.e. same conditions for each participant: Collection of documents (the dataset) Example information requests A set of questions/queries/topics Relevant documents for each request a decision: relevant or not relevantEvaluation Campaigns CycleTask definition Data preparation Results analysis Topic definitionResults evaluationExperiments submissionScientific production Evaluation Campaign Relevant assessment Who says which documents are relevant and which not? Ideally Sit down and look at all documents Pooled assessment Each system finds some relevant documents Different systems find different relevantdocuments Together, enough systems will find most of them Assumption: if there are enough systems being tested and not one of them returned a document the document is not relevantEvaluation Campaign Relevant assessment: pooling Combine the results retrieved by all systems Choose a parameter k (typically 100) Choose the top k documents as ranked ineach submitted run The pool is the union of these sets of docs Give pool to judges for relevance assessments Evaluation Campaign Relevant assessment: pooling Advantages of pooling: Fewer documents must be manuallyassessed for relevance Disadvantages of pooling: Cant be certain that all documents satisfyingthe query are found (recall values may not beaccurate) Runs that did not participate in the poolingmay be disadvantaged If only one run finds certain relevantdocuments, but ranked lower than 100, it will not get credit for these. Evaluation Campaign Relevant assessment Not only are we incomplete, but we might also be inconsistent in our judgments!Evaluation Campaign Relevant assessment Good news: idiosyncratic nature of relevance judgmentsdoes not affect comparative results (E.Voorhees) Similar results held for: Different query sets Different evaluation measures Different assessor types Single opinion vs .group opinion judgments…wait for the 2nd assignment Summary We have seen how to evaluate IR systems We have seen evaluation methods (batch, user-study, online) and metrics (precision, recall,…) We have explore the cycle of the evaluation campaigns Relevance assessments are needed for the consistent evaluation of IR system Reading Chapter 8 Fuhr, N. “Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum 51(3), December 2017. Next week: Chapter 3 Sections 4.4 and 4.5 Section 7.5 Henrik Nordmark 4pm! Profusion MSc Data Science Information on paid placement opportunitiesAcknowledgements Thanks again to contain substantial material prepared by Udo Kruschwitz and Fernando Diaz Additional material as provided by Croft et al. (2015)
Reviews
There are no reviews yet.