[SOLVED] Java graph statistic software Assessed Coursework

$25

File Name: Java_graph_statistic_software_Assessed_Coursework.zip
File Size: 461.58 KB

5/5 - (1 vote)

Assessed Coursework
Course Name
Information Retrieval H/M
Coursework Number
1
Deadline Time:
04:30 pm
Date:
15th February 2019
% Contribution to final course mark
8%
Solo or Group u Solo
u
Group
Anticipated Hours
10 hours
Submission Instructions
See details in assignment.
Please Note: This Coursework cannot be Re-Done
Code of Assessment Rules for Coursework Submission
Deadlines for the submission of coursework which is to be formally assessed will be published in course documentation, and work which is submitted later than the deadline will be subject to penalty as set out below.
The primary grade and secondary band awarded for coursework which is submitted after the published deadline will be calculated as follows:
(i) in respect of work submitted not more than five working days after the deadline
a. the work will be assessed in the usual way;
b. the primary grade and secondary band so determined will then be reduced by two secondary
bands for each working day (or part of a working day) the work was submitted late.
(ii) work submitted more than five working days after the deadline will be awarded Grade H.
Penalties for late submission of coursework will not be imposed if good cause is established for the late submission. You should submit documents supporting good cause via MyCampus.
Penalty for non-adherence to Submission Instructions is 2 bands
You must complete an Own Work form via https://studentltc.dcs.gla.ac.uk/ for all coursework

Introduction
Information Retrieval H/M
Exercise 1 January 2019
The general objective of this exercise is to deploy an IR system and evaluate it on a medium size Web dataset. Students will use the Terrier Information Retrieval platform (http://terrier.org) to conduct their experiments. Terrier is a modular information retrieval platform, allowing to experiment with various test collections and retrieval techniques. It is written in Java. This first exercise will involve installing and configuring Terrier to evaluate a number of retrieval models and approaches. In this exercise, you will be familiarising yourself with Terrier by deploying various retrieval approaches and evaluating their impact on retrieval performance, as well as learning how to conduct an experiment in IR, and how to analyse results.
You will need to download the latest version of Terrier from http://terrier.org. We will provide a sample of Web documents (a signed user agreement to access and use this sample will be required), on which you will conduct your experiments. We will give a tutorial on Terrier in the class, and you could also use the Terrier Public Forum to ask questions.
Your work will be submitted through the Exercise 1 Quiz Instance available on Moodle. The Quiz asks you various questions, which you should answer based on the experiments you have conducted.
Collection:
You will use a sample of a TREC Web test collection, of approx. 800k documents, with corresponding topics & relevance assessments. Note that you will need to sign an agreement to access this collection (See Moodle). The agreement needs to be signed by 16th January 2019, so that we can open the directory the following day. You can find the document corpus and other resources in the Unix directory /users/level4/software/IR/. This directory contains:
Dotgov_50pc/ (approx. 2.8GB) the collection to index.
TopicsQrels/ topics & qrels for three topic sets from TREC 2004: homepage, namedpage,
topic-distillation.
Resources/ features & indices provided by us for Exercise 2; not used for Exercise 1.
Exercise Specification
There is little programming in this exercise but there are numerous experiments that need to be conducted. In particular, you will conduct three tasks:
1. IndextheprovidedWebcollectionusingTerriersdefaultindexingsetup.
2. ImplementaSimpleTF*IDFweightingmodel(asdescribedinLecture3,whichyouwillhave
to add to Terrier).
3. Evaluateandanalysetheresultingsystembyperformingthefollowingexperiments:
Vary the weighting model: Simple TF.IDF vs. Terriers implemented TF.IDF vs. BM25 vs. PL2.
Apply a Query Expansion mechanism: Use of a Query Expansion Mechanism vs. Non-use of Query Expansion
These are too many experimental parameters to address all at once, hence you must follow the prescribed activities given below. Once you conduct an activity, you should answer the
Version 2.0 16th January 2

corresponding questions on the Exercise 1 Quiz instance. Ensure that you click the Next Page button to save your answers on the Quiz instance.
Q1. Start by using Terriers default indexing setup: Porter Stemming & Stopwords removed. You will need to index the collection, following the instructions in Terriers documentation. In addition, we would like you to configure Terrier with the following additional property during indexing:
indexer.meta.reverse.keys=docno
Once you have indexed the collection, answer the Quiz questions asking you to enter your main
obtained indexing statistics (number of tokens, size of files, time to index, etc).
[1 mark]
Q2. Implement and add a new Simple TF*IDF class in Terrier following the instructions in Terriers documentation. The Simple TF*IDF weighting model you are required to implement is highlighted in Lecture 3. Use the template class provided in the IRcourseHM project, available from the Github repo (https://github.com/cmacdonald/IRcourseHM).
Paste your Simple TF*IDF Java method code when prompted by the Quiz instance. Then, answer the corresponding questions by inspecting the retrieved results for the mentioned weighting models.
[5 marks]
Q3. Now you will experiment with all four weighting models (Simple TF*IDF, Terrier TF*IDF, BM25 and PL2) and analyse their results on 3 different topic sets, representing different Web retrieval tasks: homepage finding (HP04), named page finding (NP04), and topic distillation (TD04). A description of these topic sets and the underlying search tasks is provided on Moodle.
Provide the required MAP performances of each of the weighting models over the 3 topic sets. Report your MAP performances to 4 decimal places. Also, provide the average MAP performance of each weighting model across the three topic sets, when prompted by the Quiz instance.
[16 marks]
Next, for each topic set (HP04, NP04, TD04), draw a single Recall-Precision graph showing the performances for each of the 4 alternative weighting models (three Recall-Precision graphs in total). Upload the resulting graphs into the Moodle instance when prompted. Then, answer the corresponding question(s) on the Quiz instance.
[5 marks]
Finally, you should now answer on the quiz the most effective weighting model (in terms of Mean Average Precision), which you will use for the rest of Exercise 1. To identify this model, simply identify the weighting model with the highest average performance across the 3 topic sets.
[1 mark]
Q4. You will now conduct the Query Expansion experiments using the weighting model that produces the highest Mean Average Precision (MAP) across the 3 topic sets in Q3.
Query expansion has a few parameters, e.g. query expansion model, number of documents to analyse, number of expansion terms you should simply use the default query expansion settings of Terrier: Bo1, 3 documents, 10 expansion terms.
First, run the best weighting model you identified in Q3 with Query Expansion on the homepage finding (HP04), named page finding (NP04), and topic distillation (TD04) topic sets. Report the obtained MAP performances in the Quiz instance. Report your MAP performances to 4 decimal places.
[6 marks]
Version 2.0 16th January 3

Next, for each topic set (HP04, NP04, TD04) draw a single Recall-Precision graph comparing the performances of your system with and without the application of Query Expansion. Upload these
graphs into the Quiz instance.
[3 marks]
Now, for each topic set (HP04, NP04, TD04) draw a separate query-by-query histogram comparing the MAP performance of your system with and without query expansion (three histograms to produce in total). Each histogram should show two bars for each query of the topic set: one bar corresponding to the MAP performance of the system on that given query with query expansion and one bar corresponding to the MAP performance of the system on that given query without query expansion. Using these histograms and their corresponding data, you should now be able to answer the corresponding questions of the Quiz instance.
Finally, answer the final analysis questions and complete your Quiz submission.
[6 marks] [7 marks]
Hand-in Instructions: All your answers to Exercise 1 must be submitted on the Exercise 1 Quiz instance, which will be available on Moodle. This exercise is worth 50 marks and 8% of the final course grade.
NB 1: You can (and should) naturally complete the answers to the quiz over several iterations. However, please ensure that you save your intermediary work on the Quiz instance by clicking the Next Page button every time you make any change in a given page of the quiz and you want it to be saved.
NB 2: To save you a lot of time, you are encouraged to write scripts for the collection and management of your experimental data (and to ensure that you dont mix up your results) as well as the production of graphs using a plethora of existing tools.
Version 2.0 16th January 4

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Java graph statistic software Assessed Coursework
$25