[Solved] CMSC409 Project 4-Feature vector

30 $

File Name: CMSC409_Project_4-Feature_vector.zip
File Size: 301.44 KB

SKU: [Solved] CMSC409 Project 4-Feature vector Category: Tag:

Or Upload Your Assignment Here:


  1. Download and unzip “Project4_sentences.zip” and “Project4_code.zip” files.

A set of sentences is given in the file “sentences.txt”. Each sentence is a line in the file. Create the feature vector by writing a program that applies the following text mining techniques to this set of sentences.

  1. Tokenize sentences
  2. Remove punctuation and special characters
  3. Remove numbers
  4. Convert upper-case to lower-case
  5. Remove stop words. A set of stop words is provided in the file “txt
  6. Perform stemming. Use the Porter stemming code provided in the file “txt
  7. Combine stemmed words.
  8. Extract most frequent words.

Provide the feature vector in your report.

Note:

The feature vector contains unique sets of words that appear in the set of sentences provided.

The file “Project4_code.zip” contains implementations of the Porter Stemmer in several languages. You can use any version of the code provided (provided versions of the code are Java, Matlab, Python, and C). Make sure you rename your file accordingly. More source code for the Porter Stemmer can be found here: http://tartarus.org/martin/PorterStemmer/

Page 1 of 2

CMSC 409: Artificial Intelligence

Project 4

  1. Using the feature vector generated in first task, write a program that generates the Term Document Matrix (TDM) for ALL the sentences in “txt”, similar to TDM below. Example TDM
Keyword set anonymous identify car
Sentence 1 1 4 3
Sentence 2 2 0 1
…..
Sentence 20 2 0 0
  1. Provide the TDM in your report.
  2. For each of the text mining steps (A to H), explain why they are used, and what sort of information is lost while applying each of the text-mining steps.
  3. Write a program implementing the clustering algorithm of your choice (WTA or FCAN). Apply that algorithm to TDM to group similar sentences together.
    1. How many clusters/topics have you identified?
    2. What drives the dimensionality of TDM? What can you do to reduce that dimensionality? Does the order of data being fed to algorithm matter?
    3. Show and comment the results.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CMSC409 Project 4-Feature vector
30 $