- Download and unzip Project4_sentences.zip and Project4_code.zip files.
A set of sentences is given in the file sentences.txt. Each sentence is a line in the file. Create the feature vector by writing a program that applies the following text mining techniques to this set of sentences.
- Tokenize sentences
- Remove punctuation and special characters
- Remove numbers
- Convert upper-case to lower-case
- Remove stop words. A set of stop words is provided in the file txt
- Perform stemming. Use the Porter stemming code provided in the file txt
- Combine stemmed words.
- Extract most frequent words.
Provide the feature vector in your report.
Note:
The feature vector contains unique sets of words that appear in the set of sentences provided.
The file Project4_code.zip contains implementations of the Porter Stemmer in several languages. You can use any version of the code provided (provided versions of the code are Java, Matlab, Python, and C). Make sure you rename your file accordingly. More source code for the Porter Stemmer can be found here: http://tartarus.org/martin/PorterStemmer/
Page 1 of 2
CMSC 409: Artificial Intelligence
Project 4
- Using the feature vector generated in first task, write a program that generates the Term Document Matrix (TDM) for ALL the sentences in txt, similar to TDM below. Example TDM
Keyword set | anonymous | identify | car | |
Sentence 1 | 1 | 4 | 3 | |
Sentence 2 | 2 | 0 | 1 | |
.. | ||||
Sentence 20 | 2 | 0 | 0 |
- Provide the TDM in your report.
- For each of the text mining steps (A to H), explain why they are used, and what sort of information is lost while applying each of the text-mining steps.
- Write a program implementing the clustering algorithm of your choice (WTA or FCAN). Apply that algorithm to TDM to group similar sentences together.
- How many clusters/topics have you identified?
- What drives the dimensionality of TDM? What can you do to reduce that dimensionality? Does the order of data being fed to algorithm matter?
- Show and comment the results.
Reviews
There are no reviews yet.