Name: [Solved] Cmpe493 Assignment 1 – A Simple Document Retrieval System for Boolean Queries - Assignment Chef
SKU: [Solved] Cmpe493 Assignment 1 – A Simple Document Retrieval System for Boolean Queries
Price: 30 USD
Availability: InStock

In this assignment you will implement a document retrieval system for simple boolean queries using the positional inverted indexing scheme. You will use the Reuters-21578 data set, which can be download from Moodle (reuters21578.zip). Reuters-21578 contains 21578 news stories from Reuters newswire classified under one or more of 118 categories. There are 22 SGML files, each containing 1000 news articles, except the last file, which contains 578 articles. You should perform the following steps:

Pre-processing the Data Set : The text of a news story is enclosed under the <TEXT> You should use the <TITLE> and the <BODY> fields to extract the text of a news story. Implement your own tokenizer to get the tokens from the news texts and perform normalization operations including case-folding, stopword removal, and stemming. Please use the stopword list on Moodle (stopwords.txt) for stopword removal and the Porter Stemmer (http://tartarus.org/martin/PorterStemmer/) for obtaining the stem of each token. Note that you should perform the same preprocessing steps for the queries as well. (Assume that the files in the reuters21578.zip file are in the folder named as “Dataset”.)
Building the Positional Inverted Index: Instead of taking each SGML file as a document unit, you should index each news article as a separate document and use the NEWID field as document IDs. Then, you need to generate two files, namely the dictionary and the inverted index. The first file should contain the stems of the words and their ids as a hashmap. The second file should contain the positional indexes of each stem based on their ids in the first file. When you implement the boolean query processor, you should use only these two files, not the original Reuters-21578 dataset. Note that the inverted index construction and query processor should be designed as two separate modules. That is, the query processor should NOT construct the inverted index each it is run. It should just use the dictionary and inverted index files generated by the indexing module.
Implementing a Boolean query processor: You should implement the postings merge algorithm for conjunctive, phrase, and proximity queries. Assume that each query consists of either a conjunction, a phrase, or a proximity of terms. You do NOT need to handle disjunctive queries, wildcard queries, or queries containing the NOT operator or parenthesis. That is, the queries will be of the following three types:
- w₁AND w₂AND w₃… AND w_n
- w₁w₂w₃… w_n
- w₁/k₁w₂/k₂w₃/k₃… /k_n₋₁w_nwhere each w_iis a single-word keyword and k_i₋₁is a constant number which implies that w_iis within k_i₋₁words of w_i₋₁. That is, a matching document will have w_i₋₁in the text followed by at most k_i₋₁words, followed by w_i. For example, w₁/0w₂is equivalent to the phrase query w₁w₂. Note that the order of the query terms is important for the proximity queries: w_i₋₁k_i₋₁w_iis NOT the same query as w_ik_i₋₁w_i₋₁.

The query processor should take as input a query preceded by its query type information (1 for conjunctive query, 2 for phrase query, and 3 for proximity query). Here are some example input queries:

1 oil AND price
2 oil price
3 oil /3 price

The query processor should return the IDs of the matching documents sorted in ascending order.

You may use any programming language of your choice. However, we should be able to run your program by following the instructions in your readme file and you should not use any third party libraries except for stemming.

Submission: You should submit a “.zip” file named as YourNameSurname.zip containing the following files using the Moodle system:

Report:
- Describe the steps you have performed for data preprocessing and provide answers for thefollowing questions.
- How many tokens does the corpus contain before stopword removal and stemming?
- How many tokens does the corpus contain after stopword removal and stemming?
- How many terms (unique tokens) are there before stopword removal, stemming, and case-folding?
- How many terms (unique tokens) are there after stopword removal, stemming, and casefolding?
- List the top 20 most frequent terms before stopword removal, stemming, and casefolding?
- List the top 20 most frequent terms after stopword removal, stemming, and case-folding?
- Describe the data structures that you used for representing the positional index (dictionaryand postings).
- Provide a screenshot of running your system for a conjunctive query.
- Provide a screenshot of running your system for a phrase query.

(iv) Provide a screenshot of running your system for a proximity query.

Source code and executable: Commented source code and executables of your document retrieval system.
Readme: Detailed readme describing how to run your program.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] Cmpe493 Assignment 1 – A Simple Document Retrieval System for Boolean Queries

Reviews

Related products

[Solved] Cmpe493 Assignment3-Single Document Extractive Text Summarization with LexRank

[Solved] Cmpe493 Assignment2-Text Classification using Naive Bayes