[Solved] COMPSCI121/IN4MATX141 Project 3 Search Engine

$25

File Name: COMPSCI121/IN4MATX141_Project_3__Search_Engine.zip
File Size: 433.32 KB

SKU: [Solved] COMPSCI121/IN4MATX141 Project 3 – Search Engine Category: Tag:
5/5 - (1 vote)

In this assignment you will be building a simple search engine based off of the concepts you have learned in the lectures so far. When working on this project, you will be working on achieving two milestones each with its deliverables and deadline. Bear in mind that when planning for milestone #1, you are also cognizant of the requirements of milestone #2. This is to ensure you are on the right track to completing this project successfully. When you will have completed milestone #2, you will be required to demonstrate the functioning of your search engines to the TAs (F2Fs).

You can use code that you or any classmate wrote for the previous projects. You cannot use code written for this project by non-group-member classmates. Use code found over the Internet at your own peril it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. As stated in the course policy document, concealing the origin of a piece of code is plagiarism.

Use the Discussion Board on Canvas to post your questions about this assignment so that the answers can help you and other students as well.

Goal: Implement a complete search engine.

Milestones Overview

Milestone Deadline Goal Deliverables Score (out of

5 0)

# 1 5 / 1 9 Produce an initial index for thecorpus and a basic retrieval component Short report (no demo) 5 0 % (2 5 )

PROJECT: SEARCH ENGINECorpus: all ICS web pagesWe will provide you with the crawled data as a zip file (webpages.zip). This contains thedownloaded content of the ICS web pages that were crawled by us. You are expected to build yoursearch engine index off of this data.Main challenges: Full HTML parsing, File/DB handling, handling user input (either using commandline or desktop GUI application or web interface)COMPONENT 1 INVERTED INDEX:Create an inverted index for all the corpus given to you. You can either use a database to store yourindex (MongoDB, Redis, memcached are some examples) or you can store the index in a file. Youare free to choose an approach here.The index should store more than just a simple list of documents where the token occurs. At thevery least, your index should store the TF-IDF of every term/document pair.Sample Index:Note: This is a simplistic example provided for your understanding. A good inverted indexwill store more information than this in order to obtain better search results.o Index Structure: token docId1, tf-idf1 ; docId2, tf-idf2o Example: informatics doc_1, 2.3 ; doc_2, 3.1 ; doc_3, 1.3You are encouraged to come up with heuristics that make sense and will help in retrieving relevantsearch results. For e.g. words in bold and in heading tags (h1, h2, h3) could be treated as moreimportant than the other words. These are useful metadata that could be added to your invertedindex data.Optional:Extra credit will be given for ideas that improve the quality of the retrieval, so you may add moremetadata to your index, if you think it will help improve the quality of the retrieval.Such as, for instance, the positions of the words in the page, HTML tag weights, etc. To store thisinformation, you need to design your index in such a way that it can store and retrieve all thismetadata efficiently. Your index lookup during search should not be horribly slow, so pay attentionto the structure of your index and retrieval algorithms.

COMPONENT 2 SEARCH AND RETRIEVE:Your program should prompt the user for a query. This doesnt need to be a Web interface, it can bea console prompt. At the time of the query, your program will look up your index, perform somecalculations (see ranking below) and give out the ranked list of pages that are relevant for thequery.Optional:Extra credit will be given if your search interface has a good GUI. The GUI has to be an intuitivesearch engine interface and should be able to run on the web browser.COMPONENT 3 RANKING:At the very least, your ranking formula should include TF-IDF (term frequency-inverse documentfrequency) scoring, but you should feel free to add additional components to your ranking formulato improve the retrieval.Optional:Extra credit will be given if your ranking formula includes heuristics more than just TF-IDF. You canapply the ranking schemes that were taught in class.NOTE: the TA will ask detailed questions regarding your search engines ranking heuristics. Soensure you thoroughly understand the ranking methods that youll implement.Use of libraries:To extract the content from the HTML tags, you will be using an HTML parser. There are manylibraries available to achieve this task and we encourage you to compare the available optionsbefore selecting a library to perform HTML parsing for you (Suggestions: Beautifulsoup,HTMLParser)It is strictly not allowed to use libraries that perform the entire task of index creation orranking for you. Hence, libraries such as Lucene or Elastic Search or TF-IDF calculatinglibraries are not allowed.You may use libraries that help you achieve specific tasks such as parsing. For example, you canuse a tokenizer such as NLTK to tokenize your content, as text processing is not the primary focusof this assignment.

Milestone #1 Due on May 19thGoal: Build an index and a basic retrieval component. (Component 1 as described above)By basic retrieval component; we mean that at this point you just need to be able to query yourindex for links (The query can be as simple as single word at this point).These links do not need to be accurate/ranked. We will cover ranking in the next milestone.At least the following queries should be used to test your retrieval:1 Informatics2 Mondego3 Irvine4 artificial intelligence5 computer scienceNote: query 4 and 5 are for milestone #2Deliverables: Submit a report (PDF) in Canvas with the following content:1. A table with details pertaining to your index. It should have the following details regardingyour inverted index:a. Number of documents of the corpusb. Number of [unique] tokens present in the indexc. The total size (in KB) of your index on disk.2. URLs retrieved for each of the queries above. You can submit the first 10 results for each ofthe search queries requested above.a. Please do not submit all the URLs that you obtain by running the query.b. The quality of the search results does not matter for this milestone. You will workon improving the results in the second milestone.Evaluation criteria: Were the reported details regarding your inverted index plausible? Are the reported URLs plausible? Was the report submitted on time?

Milestone #2 Due on May 27thGoal: Complete Search Engine. Components 2 and 3 as described above.For this milestone, you will complete your search engines search and ranking components. You arefree to implement more than TF-IDF for your ranking scheme. If you feel there is a better rankingscheme that does not use TF-IDF altogether, youre free to choose that as well, but ensure that youdo get better results than TF-IDF.Deliverables: Submit a zip file containing all the artifacts/programs you wrote for your search engineproject.Following this milestone, your project group will meet with a TA to show a live demo of your searchengine.Evaluation criteria: Does your program work as expected of search engines? How good are the heuristics that you employed to improve the retrieval? Was there anactual improvement in the search from the results obtained in Milestone 1? Do you demonstrate in-depth knowledge of how your search engine works? Are you able toanswer detailed questions pertaining to any aspect of its implementation?Extra Credit: How good is the web GUI? (e.g. links to the actual pages, snippets, etc.) Does the ranking formula incorporate more than just TF-IDF?

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] COMPSCI121/IN4MATX141 Project 3 Search Engine
$25