This assignment is on classification of text documents. It is highly recommended that you use python3 for this assignment as libraries like nltk will make many things easier (stop word removal and lemmatization). However, if you use any other language, you most probably have to design these modules yourselves which might not perform as good as nltk library in python. Also you can use packages from scipy for classification purposes.
Assumptions:Remove stop words, punctuation marks, make everything to lowercase and perform lemmatization to generate tokens from the document (use nltk library in python). Assume positional and class conditional independence of the terms in document for Naive Bayes. For vector space classifications, assume each document is represented by its normalized tf-idf vector representation. The tf-idf vector construction follows the same procedure as stated in assignment 2.
- Naive Bayes with feature selection
- Select top x features using mutual information from both train data. Vary x in {1, 10, 100, 1000, 10000}.
- Using each of the above x, train a multinomial Naive Bayes on the given train data, with add-one smoothing.
- Using each of the above x, train a Bernoulli Naive Bayes on the given train data.
- Print F1 score for each of the classifier on the test data for each of the feature value.
- Vector space classification Linear: Use Rocchio classifier to classify documents in the test data and print the F1 For Rocchio classifier, use the decision rule as follows. Assign d to class c iff |(c) v(d)| < |(c)v(d)|b. Vary b within the range {0, .01, .05, .1} and print F1 score for the b values.
- Vector space classification Non linear: Use kNN classifier to classify documents in the test data and report F1 Vary k in {1, 10, 50}. For
1
similarity score use inner product of vector representation of two documents. Print F1 scores on test data.
Instructions for submission Submit your codes named as Rollno Taskno.py.
Your code will be executed as python3 your-code.py path-data-directory output-file
You should print F1 scores in space separated manner in the output file as simple text. Sample output should look like
Rest two files resembles similar structure

![[Solved] CS60092 Assignment3-Text Classification of text documents IR](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] CS60092 Assignment2-Ranked Retrieval for Free Text Queries IR](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.