The goal of the assignment is to build a search engine from scratch, which is an example of an information retrieval system. In the class, we have seen the various modules that serve as the building blocks of a search engine. We will be progressively building the same as the course progresses. The first part of this assignment is to build a basic text processing module that implements sentence segmentation, tokenization, stemming/lemmatization and stopword removal. The Cranfield dataset will be used for this purpose, which has been uploaded separately on Moodle.
- What is the simplest and obvious top-down approach to sentence segmentation for English texts?
- Does the top-down approach (your answer to the above question) always do correct sentence segmentation? If Yes, justify. If No, substantiate it with a counter example.
- Python NLTK is one of the most commonly used packages for Natural Language Processing. What does the Punkt Sentence Tokenizer in NLTK do differently from the simple top-down approach? You can read about the tokenizer here.
- Perform sentence segmentation on the documents in the Cranfield dataset using:
- The top-down method stated above
- The pre-trained Punkt Tokenizer for English
State a possible scenario along with an example where:
- the first method performs better than the second one (if any)
- the second method performs better than the first one (if any)
- What is the simplest top-down approach to word tokenization for English texts?
- Study about NLTKs Penn Treebank tokenizer here. What type of knowledge does it use Top-down or Bottom-up?
- Perform word tokenization of the sentence-segmented documents using
- The simple method stated above
- Penn Treebank Tokenizer
State a possible scenario along with an example where:
(a) the first method performs better than the second one (if any)
Assignment Part 1 Page 1 (b) the second method performs better than the first one (if any)
- What is the difference between stemming and lemmatization?
- For the search engine application, which is better? Give a proper justification to your answer. This is a good reference on stemming and lemmatization.
- Perform stemming/lemmatization (as per your answer to the previous question) on the word-tokenized text.
- Remove stopwords from the tokenized documents using a curated list of stopwords (for example, the NLTK stopwords list).
- In the above question, the list of stopwords denotes top-down knowledge. Can you think of a bottom-up approach for stopword removal?

![[Solved] CS6370 Assignment1 Part1](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] CS6370 Assignment1 Part2](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.