Download the 20 _newsgroup dataset. You need to pick documents of comp.graphics, sci.med, talk.politics.misc, rec.sport.hockey, sci.space [5 classes] for text classification.
Implement the following algorithms for text classification:
- Naive Bayes
- kNN (vary k=1,3,5)
Feature selection techniques to be used with both algorithms:
- Tf-IDF
- Mutual Information Implementation Points:
- Perform the data pre-processing steps.
- Split your dataset randomly into train: test ratio. You need to select the documents randomly for splitting. You are not supposed to split documents in sequential order, for instance, choosing the first 800 documents in the train set and last 200 in the test set for the train: test ratio of 80:20.
- Implement the TF-IDF scoring technique and mutual information technique for efficient feature selection.
- For each class train your Naive Bayes Classifier and kNN on the training data.
- Test your classifiers on testing data and report the confusion matrix and overall accuracy.
- Perform the above steps on 50:50, 70:30, and 80:20 training and testing split ratios.
- Compare and analyze the performance of the above-mentioned two classification algorithms for both the feature selection techniques across different train: test ratios. Use graphs to report the performance comparison. Also, mention your inferences from the graphs. Example of a graph you can report a graph showing the performance of kNN for different values of k.
Reviews
There are no reviews yet.