5/5 - (1 vote)

The goal of the assignment is to build a search engine from scratch, which is an example of an information retrieval system. In the class, we have seen the various modules that serve as the building blocks of a search engine. We will be progressively building the same as the course progresses. The first part of this assignment is to build a basic text processing module that implements sentence segmentation, tokenization, stemming/lemmatization and stopword removal. The Cranfield dataset will be used for this purpose, which has been uploaded separately on Moodle.

What is the simplest and obvious top-down approach to sentence segmentation for English texts?
Does the top-down approach (your answer to the above question) always do correct sentence segmentation? If Yes, justify. If No, substantiate it with a counter example.
Python NLTK is one of the most commonly used packages for Natural Language Processing. What does the Punkt Sentence Tokenizer in NLTK do differently from the simple top-down approach? You can read about the tokenizer here .
Perform sentence segmentation on the documents in the Cranfield dataset using:
- The top-down method stated above
- The pre-trained Punkt Tokenizer for English

State a possible scenario along with an example where:

the first method performs better than the second one (if any)
the second method performs better than the first one (if any)

What is the simplest top-down approach to word tokenization for English texts?
Study about NLTKs Penn Treebank tokenizer here . What type of knowledge does it use Top-down or Bottom-up?
Perform word tokenization of the sentence-segmented documents using
- The simple method stated above
- Penn Treebank Tokenizer

State a possible scenario along with an example where:

(a) the first method performs better than the second one (if any)

Assignment Part 1 Page 1 (b) the second method performs better than the first one (if any)

What is the difference between stemming and lemmatization?
For the search engine application, which is better? Give a proper justification to your answer. This is a good reference on stemming and lemmatization.
Perform stemming/lemmatization (as per your answer to the previous question) on the word-tokenized text.
Remove stopwords from the tokenized documents using a curated list of stopwords (for example, the NLTK stopwords list).
In the above question, the list of stopwords denotes top-down knowledge. Can you think of a bottom-up approach for stopword removal?

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] CS6370 Assignment1 Part1

Reviews

Whatsapp Us

[Solved] CS6370 Assignment1 Part1

Reviews

Related products

[Solved] CS6370 Assignment1 Part2