You are provided a dataset for this assignment, which you are free to use. You can use your own dataset as well.
Assignment
In this assignment we continue building an information retrieval system based on the results of Assignment 1, which was a system that can output a sorted list of term-document pairs.
The task in this assignment is:
- Build an inverted index;
- Enable simple Boolean search; 3) Implement compression techniques.
1. Inverted Index
Input: a file with sorted term-doc pairs Output: inverted index
In this part you would need to take the file containing the sorted list of term-doc pairs and transform it into a simple inverted index. In this assignment you dont have to worry that the list or the inverted index can be too big for main memory; however, you can take that into account.
Bonus Points: persist the inverted index as a file so you dont need to rebuild it every time you launch the program.
2. Boolean Search
Input: a search query, an inverted index
Output: a list of documents satisfying the query
Implement a simple AND-based Boolean search, i.e., a query horse car phone should be treated as horse AND car AND phone and return only documents that contain all three words.
Bonus Points: Implement OR and NOT in addition to AND.
3. Index Compression/Optimization
Implement compression and optimization techniques that were discussed in the lectures. In particular, implement at least the Dictionary-as-a-String approach. Implementing other techniques (blocking, front-coding, skip pointers, variable-length gap encoding) is encouraged.
Compare your search engine performance and memory requirements before and after implementing compression and optimizations. Reflect the comparison in your report.
Bonus Points: If you implement many techniques and manage to achieve impressive savings in speed and/or memory, that may earn you bonus points.
Reviews
There are no reviews yet.