Project 2: Boolean Query and Inverted Index
In this project, you will be given a sample input text file consisting of Doc IDs and sentences. Based on this provided input text file, your task is to build your own inverted index using the data. Your index should be stored as a Linked List in memory as the examples shown in the textbook (Refer Chapter 1 Boolean Retrieval). Having built this index, you are required to implement a Document-at-a-time (DAAT) strategy to return Boolean query results. Your implementation should be based on Python3 only .
Input Dataset
input_corpus.txt is a tab-delimited file where each line is a document; the first field is the document ID, and the second is a sentence. The two fields are separated by a tab.
Example:
Step 1: Build Your Own Inverted Index
Implement a pipeline which takes as input the given corpus, and returns an inverted index.
- Extract the document ID and document from each line.
- Perform a series of preprocessing on the document:
- Convert document text to lowercase
- Remove special characters. Only alphabets, numbers and whitespaces should be present in the document.
- Remove excess whitespaces. There should be only 1 white space between tokens, and no whitespace at the starting or ending of the document.
- Tokenize the document into terms using white space tokenizer.
- Remove stop words from the document.
- Perform Porters stemming on the tokens.
- Sample preprocessing output
Note: Do not use NLTK or any other package for whitespace tokenization or special character removal. You might end up getting different results. Usage of simple python regex and inbuilt functions are recommended.
- For each token, create a postings list. Postings list must be stored as linked lists. Postings of each term should be ordered by increasing document ids.
Step 2: Boolean Query Processing
You are required to implement the following methods, and provide the results for a set of Boolean AND queries on your own index. Results should be output as a
JSON file in the required format.
Given a user query, the first step must be preprocessing the query using the same document preprocessing steps.
1. Below are the preprocessing steps you must perform on each user query
- Convert query to lowercase
- Remove special characters from query
- Remove excess whitespaces. There should be only 1 white space between query tokens, and no whitespace at the starting or ending of the query.
- Tokenize the query into terms using white space tokenizer.
- Remove stop words from the query.
- Perform Porters stemming on the query tokens.
- Sample query processing:
2. Get postings lists
This method retrieves the postings lists for each of the given query terms. Input of this method will be a set of terms: term0, term1,, termN. It should output the document ID wise sorted postings list for each term. Below is a sample format of the same:
3. Document-at-a-time AND query
This function is used to implement multi-term boolean AND query on the index using document-at-a-time(DAAT) strategy. Input of this function will be a set of query terms: term0, term1, , termN. You will have to implement the MERGE algorithm, and return a sorted list of document ids, along with the number of comparisons made. Below is a sample of the output:
NOTE: A comparison (for field num_comparisons) is counted whenever you compare two Document IDs during the union or intersection operation.
Determine the correct merge order to optimize num_comparisons.
Step 3: JSON output in the correct format
The results of the postings list and DAAT AND queries must be combined in a single python dictionary, and saved as a json file. Below is a sample of the final output.
Sample JSON output format:
NOTE:
There is no requirement for the names of your methods. However, you should submit a .py file named exactly as UBITName_project2.py, and your program should start running by executing the following command on Timberlake :
python3 UBITName_project2.py path_of_input_corpus output.json query.txt
python3 sougatas_project2.py ./data/input_corpus.txt ./output.json ./queries.txt
Here, the first input parameter path_of_input_corpus should be able to accept a string indicating the path of the tab-delimited file containing the corpus . The second input parameter output.json refers to the output file to write to, while the third, queries.txt refers to the input file containing the query terms.
The following assumptions can be made:
- The number of input query terms can be varied.
- All of the input query terms are selected from the vocabulary.
- Query terms should be processed in the order in which they are written in the query. Say, you should process term0 first, then term1, so on and so forth.
- DocumentID and corresponding sentence text will be separated by tab space in the input file reference by
- DO NOT use python build-in methods to do unions and intersections on postings lists directly. Create your own Pointers/References!
- We will use Moss to check plagiarism .
- Output should be formatted exactly the same as required. Otherwise you will not be able to get credits because grading will be automated!
Reviews
There are no reviews yet.