[Solved] CSE5334 Assignment 1-TF-IDF and Cosine Similarity

$25

File Name: CSE5334_Assignment_1-TF-IDF_and_Cosine_Similarity.zip
File Size: 461.58 KB

SKU: [Solved] CSE5334 Assignment 1-TF-IDF and Cosine Similarity Category: Tag:
5/5 - (1 vote)

The instructions on this assignment are written in an .ipynb file. You can use the following commands to install the Jupyter notebook viewer. pip is a command for installing Python packages. You are required to use Python 3.6.x (any version of Python equal to or greater than version Python 3.6.0.) in this project.

pip install jupyter

To run the Jupyter notebook viewer, use the following command:

jupyter notebook P1.ipynb

The above command will start a webservice at http://localhost:8888/ (http://localhost:8888/) and display the instructions in the .ipynb file.

Requirements

This assignment must be done individually. You must implement the whole assignment by yourself.

Academic dishonety will have serious consequences.

You can discuss topics related to the assignment with your fellow students. But you are not allowed to discuss/share your solution and code.

Assignment Files

All the files for this assignment can be downloaded from Blackboard (Course Materials > Programming Assignments > Programming Assignment 1 (P1) > Attached Files).

  1. This instruction file itself P1.ipynb
  2. Data file debate.txt

We use the transcript of the latest Texas Senate race debate between Senator Ted Cruz and Representative Beto ORourke, which took place on October 16, 2018. We pre-processed the transcript and provide you a text file debate.txt. In the file each paragraph is a segement of the debate from one of the candidates or moderators.

  1. Sample results sampleresults.txt
  2. Grading rubrics rubrics.txt

Programming Language

  1. We will test your code under the particular version of Python 3.6.x. So make sure you develop your code using the same version.
  2. You are free to use anything from the Python Standard Library that comes with Python 3.6.x (https://docs.pyorg/3.6/library/ (https://docs.python.org/3.6/library/)).
  3. You are expected to use several modules in NLTKa natural language processing toolkit for Python. NLTK doesnt come with Python by default. You need to install it and import it in your .py file. NLTKs website (http://www.nltk.org/index.html (http://www.nltk.org/index.html)) provides a lot of useful information, including a book http://www.nltk.org/book/ (http://www.nltk.org/book/), as well as installation instructions (http://www.nltk.org/install.html (http://www.nltk.org/install.html)).
  4. You are NOT allowed to use any non-standard Python package, except NLTK.

Task TF-IDF and Cosine Similarity

1. Description of Task

You code should accomplish the following tasks:

  • Read the text file debate.txt. This is the transcript of the latest Texas Senate race debate between Ted Cruz and Beto ORourke. The following code does it.
  • Tokenize the content of the file. For this, you need a tokenizer. For example, the following piece of code uses a regular expression tokenizer to return all course numbers in a string. Play with it and edit it. You can change the regular expression and the string to observe different output results.

For tokenizing the Texas Senate debate transcript, lets all use RegexpTokenizer(r'[a-zA-Z]+). What tokens will it produce? What limitations does it have?

In [ ]: from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'[A-Z]{2,3}[1-9][0-9]{3,3})

tokens = tokenizer.tokenize(CSE4334 and CSE5334 are taught together. IE

3013 is an undergraduate course.) print(tokens)

[CSE4334, CSE5334, IE3013]

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/ index.xml

After the stopword list is downloaded, you will find a file english in folder nltk_data/corpora/stopwords, where folder nltk_data is the download directory in the step above. The file contains 179 stopwords. nltk.corpus.stopwords will give you this list of stopwords. Try the following piece of code.

In [ ]: from nltk.corpus import stopwords

print(stopwords.words(english)) print(sorted(stopwords.words(english)))

  • Also perform stemming on the obtained tokens. NLTK comes with a Porter stemmer. Try the following code and learn how to use the stemmer.
  • Using the tokens, compute the TF-IDF vector for each paragraph. In this assignment, for calculating inverse document frequency, treat debate.txt as the whole corpus and the paragraphs as documents. That is also why we ask you to compute the TF-IDF vectors separately for all the paragraphs, one vector per paragraph.

Use the following equation that we learned in the lectures to calculate the term weights, in which t is a token and d is a document (i.e., paragraph):

N

wt,d = (1 + log10 tft,d) (log10 ).

dft

Note that the TF-IDF vectors should be normalized (i.e., their lengths should be 1).

Represent a TF-IDF vector by a dictionary. The following is a sample TF-IDF vector.

In [ ]: {sanction: 0.014972337775895645, lack: 0.008576372825970286, regre t: 0.009491784747267843, winter: 0.030424375278541155}

  • Given a query string, calculate the query vector. Compute the cosine similarity between the query vector and the paragraphs in the transcript. Return the paragraph that attains the highest cosine similarity score. In calculating the query vector, the vector is also to be normalized.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CSE5334 Assignment 1-TF-IDF and Cosine Similarity
$25