[SOLVED] CS代考计算机代写 data science information retrieval deep learning CE306/CE706 Information Retrieval

30 $

CE306/CE706 Information Retrieval
Evaluation
Dr Alba García Seco de Herrera

Brief Module Outline (Reminder)
§ Motivation + introduction
§ Processing pipeline / indexing + query processing
§ Large-scale open-source search tools
§ Information Retrieval models
§ Evaluation
§ Log analysis
§ User profiles / personalisation / contextualisation
§ IR applications, e.g. enterprise search

Information Retrieval Evaluation
§ Questions that arise:
§ How good is a particular IR system?
§ How does it compare to alternative systems?
§ What is the best system for a particular application?

Information Retrieval Evaluation
§ Some measures that might help: § Accuracy
§ User satisfaction § Price
§ Response time § Search time
§ ???

Information Retrieval Evaluation § Evaluation is key to building effective and
efficient search engines
§ measurement usually carried out in controlled
laboratory experiments
§ online testing can also be done
§ Effectiveness, efficiency and cost are related
§ e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration
§ efficiency and cost targets may impact effectiveness

Information Retrieval
§ Given a query and a corpus, find relevant documents
§ query: user’s textual description of their information need
§ corpus: a repository of textual documents § relevance: satisfaction of the user’s
information need
… we will focus on effectiveness measures

IR Effectiveness
§ Precision = tp/(tp+fp) § Recall = tp/(tp+fn)

IR Effectiveness (put differently)
§ A is a set of relevant documents § B is a set of retrieved documents

Precision and Recall
§ !”#$%&%'( = *+.-./.0123 4+56 7.3872.4 *+.4+56 7.3872.4
§ 9#$:;; = *+.7./.0123 4+56 7.3872.4 <+31/ *+.7./.0123 4+56§ Precision indicates what proportion of the documents returned are relevant§ Recall indicates what proportion of all relevant documents are recalled§ Both precision and recall vary between zero and one 9 Should we use the accuracy?§ Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant”§ The accuracy of an engine: the fraction of these classifications that are correct§ (tp + tn) / ( tp + fp + fn + tn)§ Accuracy is a commonly used evaluation measure in machine learning classification workSec. 8.3 10Information Retrieval Evaluation§ Evaluation is a fundamental issue of information retrieval§ an area of IR research in its own right (much of it due to uncertainty/ambiguity)§ Often the only way to show that A is better than B, is through extensive experimentation§ Evaluation methods: § batch evaluation§ user-study evaluation § online evaluation§ Each method has advantages and disadvantagesBatch Evaluation Overview§ Collect a set of queries (to test average performance)§ Construct a more complete description of the information being sought for each queryBatch EvaluationOverview: query + description (example)§ QUERY: pet therapy§ DESCRIPTION: Relevant documents must include details of how pet- and animal-assisted therapy is or has been used. Relevant details include information about pet therapy programs, descriptions of the circumstances in which pet therapy is used, the benefits of this type of therapy, the degree of success of this therapy, and any laws or regulations governing it…. a TREC example (more on that later) Batch Evaluation Overview§ Using these descriptions, have human judges determine which documents are relevant for each query§ Evaluate systems based on their ability to retrieve the relevant documents for these queries§ evaluation metric: a measurement that quantifies the quality of a particular ranking of results with known relevant/non-relevant documents Batch Evaluation Overview: metrics§ Which ranking is better?§ rank of the first relevant documentBatch Evaluation Overview: metrics§ Which ranking is better? § precision at rank 10 Batch Evaluation Overview: metrics§ Which ranking is better? § precision at rank 1 Batch Evaluation Overview: metrics§ Which ranking is better? § recall at rank 10 Batch Evaluation Overview: metrics§ Which ranking is better? § precision at rank 30 Batch Evaluation Overview: trade-offs§ Advantages:§ inexpensive (once the test collection isconstructed)§ the experimental condition is fixed; same queries, and same relevance judgements§ evaluations are reproducible; keeps us “honest”§ by experimenting on the same set of queries and judgements, we can better understand how system A is better than BBatch Evaluation Overview: trade-offs§ Disadvantages:§ high initial cost. human assessors (the ones who judge documents relevant/non-relevant) are expensive§ human assessors are not the users; judgements are made “out of context”§ assumes that relevance is the same, independent of the user and the user’s context Batch Evaluation Overview: trade-offs§ Many factors affect whether a document satisfies a particular user’s information need§ Topicality, novelty, freshness, formatting, reading level, assumed level of expertise (remember?)§ Topical relevance: the document is on the same topic as the query§ User relevance: everything else§ Which kind of relevance does batch-evaluation address?§ Whether the document contains the sought-after information User-Study Evaluation Overview§ Provide a small set of users with several retrieval systems§ Ask them to complete several (potentially different) search tasks§ Learn about system performance by:§ observing what they do§ asking about their actions and thought processes § measuring success (task completion, time, etc.)§ measuring perceived success (questionnaire data)§ Will present you an Essex-based example soon …User-Study Evaluation Overview: trade-offs§ Advantages:§ very detailed data about users’ reaction tosystems§ in reality, a search is done to accomplish a higher-level task§ in user studies, this task can be manipulated and studied§ in other words, the experimental ‘starting- point’ need not be the query User-Study Evaluation Overview: trade-offs§ Disdvantages:§ user studies are expensive (payusers/subjects, scientist’s time, data coding)§ difficult to generalize from small studies to broad populations§ the laboratory setting is not the user’s normal environment§ need to re-run experiment every time a new system is considered Online Evaluation Overview§ Given a search service with an existing user population (e.g., Google, Yandex, Bing) …§ Have x% of query traffic use system A and y% of query-traffic use system B§ Compare system effects on logged user interactions (implicit feedback)§ clicks: surrogates for perceived relevance (good)§ skips: surrogates for perceived non-relevance (bad)§ Happens every time you submit a Web search query! Implicit feedback click!can we say thatthe first result is more relevant than the second? Implicit feedback skip!click!can we say that the second result is more relevant than the first? Implicit feedback skip!skip! skip!click!can we say that the fourth result is more relevant than the first/second/third? Implicit feedbackclick!a click is a noisy surrogate for relevance! Implicit feedbackuser sees the results and closes the browser Implicit feedbackthe absence of a click is a noisy surrogate for non- relevance On-line Evaluation Overview: trade-offs§ Advantages:§ system usage is naturalistic; users are situated in their natural context and often don’t know that a test is being conducted§ evaluation can include lots of usersOn-line Evaluation Overview: trade-offs§ Disdvantages:§ requires a service with lots of users (enough of them to potentially hurt performance for some)§ requires a good understanding on how different implicit feedback signals correlate with positive and negative user experiences§ experiments are difficult to repeat Information Retrieval Evaluation§ Evaluation is a fundamental issue of IR § an area of IR research in its own right§ Evaluation methods: § batch evaluation§ user-study evaluation § online evaluation§ Each method has advantages and disadvantages Evaluation Metrics§ Remember: we focus on effectiveness measures§ Many of the metrics are ultimately derivedfrom precision and recall§ For example, F-measure:Question 1: What is the trade-off between P & R ?Question 2: Why use the harmonic mean for F1 rather than the arithmetic mean?Ranked Retrieval Precision and recall§ In most situations, the system outputs a ranked list of documents rather than an unordered set§ User behaviour assumption:§ The user examines the output ranking from top-to-bottom until he/she is satisfied or gives up § Precision/Recall @ rank K Ranked Retrieval Precision and recall: example§ Assume 20 relevant documentsKP@KR@K1(1/1) = 1.0(1/20) = 0.052(1/2) = 0.5(1/20) = 0.053(2/3) = 0.67(2/20) = 0.104(3/4) = 0.75(3/20) = 0.155(4/5) = 0.80(4/20) = 0.206(5/6) = 0.83(5/20) = 0.257(6/7) = 0.86(6/20) = 0.308(6/8) = 0.75(6/20) = 0.309(7/9) = 0.78(7/20) = 0.3510(7/10) = 0.70(7/20) = 0.35Ranked Retrieval P/R @ K§ Advantages:§ easy to compute § easy to interpret§ Disadvantages:§ the value of K has a huge impact on themetric§ how do we pick K? Ranked Retrieval motivation: average precision§ Ideally, we want the system to achieve high precision for varying values of K§ The metric average precision accounts for precision and recall without having to set K Ranked Retrieval average precision1. Go down the ranking one-rank-at-a-time 2. If the document at rank K is relevant,measure P@K§ proportion of top-K documents that arerelevant3. Finally, take the average of P@K values§ the number of P@K values will equal the number of relevant documents Ranked Retrieval average precisionsum rank 1.00 1 1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9(K) ranking1.001.001.001.001.001.001.001.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00Number of relevant documents for this query: 10 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19 10.0020 0.00 total10.00 Ranked Retrieval average precisionsum rank 1.00 1 1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9(K)1. 2. 3.Go down the ranking ranking one-rank-at0.001.000.00 0.00- 0.00-timea1.00 1.001.00 1.00 1.00caculate P@K at 000.00.00 0.000 If recall goes up, l0.00 0. 0.007.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19that rank K 0.000.001.00 1.00When recall = 1.0, 0.001.00av0.00rage P@K valuese0.000. 10.0020 total00 0.00 0.00 0.00 0.001.00 10.00 Ranked Retrieval average precisionsum rank (K) 1.00 11.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 1910.00R@[email protected] 0.50 0.00 0.67 0.75 0.800.830.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.500.76ranking 1.001.001.001.001.001.000.10 0.100.200.300.400.500.600.601.000.670.750.800.83 1.000.860.70 0.700.78 1.000.800.800.800.73 1.000.900.900.900.900.90 0.900.64 201.000.50 1.00total 10.00average-precision Ranked Retrieval average precisionsum rank (K) 1.00 12.00 2 3.00 3 4.00 4 5.00 5 6.00 6 7.00 7 8.00 8 9.00 910.00 10 10.00 11 10.00 12 10.00 13 10.00 14 10.00 15 10.00 16 10.00 17 10.00 18 10.00 19 10.00 20R@KP@K ranking1.001.001.001.001.001.001.001.001.001.000.100.200.300.400.500.600.700.800.901.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.000.910.830.770.710.670.630.590.560.530.50 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 total 10.00average-precision1.00Ranked Retrieval average precisionsum rank (K) 0.00 10.00 2 0.00 3 0.00 4 0.00 5 0.00 6 0.00 7 0.00 8 0.00 9 0.00 10 1.00 11 2.00 12 3.00 13 4.00 14 5.00 15 6.00 16 7.00 17 8.00 18 9.00 1910.00 20total 10.00R@KP@K ranking1.001.001.001.001.001.001.001.001.001.000.000.000.000.000.000.000.000.000.000.000.100.200.300.400.500.600.700.800.901.000.000.000.000.000.000.000.000.000.000.000.090.170.230.290.330.380.410.440.470.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.09 0.17 0.23 0.29 0.33 0.38 0.41 0.44 0.47 0.50 average-precision0.33Ranked Retrieval average precisionsum rank (K) 1.00 11.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 1910.00 20total 10.00R@[email protected] 0.50 0.00 0.67 0.75 0.800.830.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.500.76 ranking 1.001.001.001.001.001.000.100.100.200.300.400.500.600.601.00 0.670.750.800.830.861.000.70 0.700.78 1.000.800.80 0.800.731.00 1.000.900.900.900.900.900.901.000.640.50 average-precisionRanked Retrieval average precisionsumrank (K)R@KP@K 12345678910121314151617181920ranking1.001.001.001.001.001.001.000.100.200.200.300.400.500.600.600.700.700.800.800.900.900.900.900.900.901.001.001.000.670.750.800.830.86 0.75 1.000.78 0.70 111.000.800.730.670.62 0.64 0.600.560.530.500.470.50 1.001.00 2.002.00ranks 2 and 31.00 1.00 0.00 0.75 0.80 0.83 0.86 0.00 0.78 0.00 0.73 0.00 0.00 0.64 0.00 0.00 0.00 0.00 0.00 0.50swapped 3.00 4.00 5.00 6.00 6.00 7.00 7.00 8.00 8.00 8.00 9.00 9.00 9.00 9.00 9.00 9.0010.00 total10.00average-precision0.79Ranked Retrieval average precisionsum rank (K) 1.00 11.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 1910.00 20total 10.00R@[email protected] 0.50 0.00 0.67 0.75 0.80 0.83 0.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.500.76 ranking 1.001.001.001.001.000.100.10 0.200.300.400.500.600.600.700.700.801.000.670.750.800.83 1.000.86 0.78 1.00 0.73 1.000.800.80 1.00 1.000.900.900.900.900.900.901.000.640.50average-precisionRanked Retrieval average precisionsum rank (K) 1.00 1R@[email protected] 0.50 0.00 0.67 0.75 0.800.83 0.86 0.88 0.000.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.500.77 ranking 1.001.001.001.001.001.001.000.10 0.100.200.300.400.500.600.700.700.701.000.670.750.800.830.860.880.781.000.800.800.800.73 1.00 1.000.900.900.900.900.900.901.000.640.501.00 2 swapped 2.00 3 ranks 8 and 9 3.00 4 4.00 5 5.00 6 6.00 7 7.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19 10.00 20total 10.00average-precision Ranked Retrieval average precision§ Advantages:§ no need to choose K§ accounts for both precision and recall§ mistakes at the top are more influential§ mistakes at the bottom are still accounted for§ Disadvantages:§ not quite as easy to interpret as P/R@K§ recall can be difficult to assess in large collections Ranked RetrievalMAP: mean average precision§ So far, we’ve talked about average precision for a single query§ Mean Average Precision (MAP): average precision averaged across a set of queries§ MAP is one of the most common metrics in IR evaluation Ranked Retrieval precision-recall curves§ In some situations, we want to understand the trade-off between precision and recall§ A precision-recall (PR) curve expresses precision as a function of recall Ranked Retrieval precision-recall curves: general idea§ Different tasks require different levels of recall§ Sometimes, the user wants a few relevant documents§ Other times, the user wants most of them§ Suppose a user wants some level of recall R§ The goal for the system is to minimize the number of false negatives the user must look at in order to achieve a level of recall R Ranked Retrieval precision-recall curvesRanked Retrieval precision-recall curves which system is better? Ranked Retrieval other effectiveness metrics§ In some retrieval tasks, we really want to focus on precision at the top of the ranking§ A classic example is Web search§ users rarely care about recall§ users rarely navigate beyond the first page of results§ Discounted Cumulative Gain (DCG) gives more weight to the top-ranked results Metric review§ set-retrieval evaluation: we want to evaluate the set of documents retrieved by the system, without considering the ranking§ ranked-retrieval evaluation: we want to evaluate the ranking of documents returned by the systemMetric Review set-retrieval evaluation§ precision: the proportion of retrieved documents that are relevant§ recall: the proportion of relevant documents that are retrieved§ f-measure: harmonic-mean of precision and recall§ a difficult metric to “cheat” by getting very high precision and abysmal recall (and vice- versa) Metric Review ranked-retrieval evaluation§ P@K: precision under the assumption that the top-K results is the ‘set’ retrieved§ R@K: recall under the assumption that the top-K results is the ‘set’ retrieved§ average-precision: average of P@K values for every K where recall increases§ DCG: ignores recall, considers multiple levels of relevance, and focuses on the top ranks Evaluation Campaigns§ Several large-scale (annual) evaluation campaigns exist that allow you to enter your own system and compare against others§ Metrics tend to be the ones we discussed (effectiveness)§ Examples:§ Text REtrieval Conference (TREC) … US-based§ Conference and Labs of the Evaluation Forum (CLEF) … European§ MediaEval Benchmarking Initiative for Multimedia Evaluation … European TREC§ Annual conference to evaluate IR systems (since 1992)§ Forum for IR researchers§ TREC data collections are large (e.g.ClueWeb09, ClueWeb12 – about 1 billion Webpages)§ Number of different tracks, examples at TREC-2021:§ Conversational Assistance Track § Deep Learning Track§ Health Misinformation Track§ Podcasts Track CLEF§ Annual conference to evaluate cross language IR systems (since 2000)§ Forum for IR researchers§ Number of different tracks, examples at CLEF-2021:§ Answer Retrieval for Questions on Math§ Living Labs for Academic Search§ ImageCLEF: Multimedia Retrieval in Medicine, Lifelogging, and Internet§ Touché: Argument Retrieval§ LifeCLEF: Multimedia Retrieval in NatureMediaEval§ Annual conference to evaluate multimedia IR systems (since 2000)§ The “multi” in multimedia: speech, audio, visual content, tags, users, context§ Number of different tracks, examples at MediaEval 2020:§ FakeNews: Corona virus and 5G conspiracy§ Predicting Media Memorability§ Insight for Wellbeing: Multimodal personal health lifelog data analysis§ Emotion and Theme recognition in music using Jamendo Caption task@ImageCLEF2021§ Goal: Image understanding by aligning visual content and textual descriptors to interpret medicalC0006141: Breast C0024671: MammographyC0006826: Malignant Neoplasms C0772294: Epinastine Hydrochloride C0242114: SuspicionC4282132: MalignancyC0221198: LesionMailing list: groups.google.com/d/forum/imageclefcaption Coral task@ImageCLEF2021§ Goal: Coral Reef Image Annotation and LocalisationMailing list: https://groups.google.com/d/forum/imageclefcoralPredicting media memorability@Mediaeval2020§ Goal: predicting how memorable a video is to viewers § Textual captionhttps://multimediaeval.github.io/editions/2020/ tasks/memorability/Evaluation Campaigns§ Systematic and quantitative evaluation§ Shared tasks on shared resources § Uniform scoring procedures, i.e. sameconditions for each participant:§ Collection of “documents”§ Example information requests§ Relevant “documents” for each request Evaluation Campaign Tools we need§ Uniform scoring procedures, i.e. same conditions for each participant:§ Collection of documents (the “dataset”) § Example information requests§ A set of questions/queries/topics§ Relevant documents for each request§ a decision: relevant or not relevantEvaluation Campaigns CycleTask definition Data preparation Results analysis Topic definitionResults evaluationExperiments submissionScientific production Evaluation Campaign Relevant assessment§ Who says which documents are relevant and which not?§ Ideally – Sit down and look at all documents § Pooled assessment§ Each system finds some relevant documents§ Different systems find different relevantdocuments§ Together, enough systems will find most of them§ Assumption: if there are enough systems being tested and not one of them returned a document – the document is not relevantEvaluation Campaign Relevant assessment: pooling§ Combine the results retrieved by all systems§ Choose a parameter k (typically 100)§ Choose the top k documents as ranked ineach submitted run§ The pool is the union of these sets of docs§ Give pool to judges for relevance assessments Evaluation Campaign Relevant assessment: pooling§ Advantages of pooling:§ Fewer documents must be manuallyassessed for relevance§ Disadvantages of pooling:§ Can’t be certain that all documents satisfyingthe query are found (recall values may not beaccurate)§ Runs that did not participate in the poolingmay be disadvantaged§ If only one run finds certain relevantdocuments, but ranked lower than 100, it will not get credit for these. Evaluation Campaign Relevant assessment§ Not only are we incomplete, but we might also be inconsistent in our judgments!Evaluation Campaign Relevant assessment§ Good news:§ “idiosyncratic nature of relevance judgmentsdoes not affect comparative results” (E.Voorhees)§ Similar results held for:§ Different query sets§ Different evaluation measures§ Different assessor types§ Single opinion vs .group opinion judgments…wait for the 2nd assignment Summary§ We have seen how to evaluate IR systems§ We have seen evaluation methods (batch, user-study, online) and metrics (precision, recall,…)§ We have explore the cycle of the evaluation campaigns§ Relevance assessments are needed for the consistent evaluation of IR system Reading§ Chapter 8§ Fuhr, N. “Some Common Mistakes In IR Evaluation, And How They Can Be Avoided”. SIGIR Forum 51(3), December 2017.§ Next week:§ Chapter 3§ Sections 4.4 and 4.5 § Section 7.5 Henrik Nordmark§ 4pm!§ Profusion§ MSc Data Science§ Information on paid placement opportunitiesAcknowledgements§ Thanks again to contain substantial material prepared by Udo Kruschwitz and Fernando Diaz§ Additional material as provided by Croft et al. (2015)

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS代考计算机代写 data science information retrieval deep learning CE306/CE706 Information Retrieval
30 $