This assignment deals with using textblob
and other open-source libraries to perform NLP-based analysis on documents using Python. All parts should use the same three documents (as outlined in Part 1 below). In addition to your .ipynb and/or .py files, you must submit a report document (in .doc or .pdf format) that answers various questions below.
Part 1:
Select and download three texts of your choosing that represent different media or writing formats (for example, you could choose i. a novel, movie script, and play script or ii. a short story, poem, and novel, etc.) Make sure you briefly descibe your documents and explain the difference between them in a paragraph.
Part 2:
(a) Compute word counts for each of your documents after excluding English stop words (and optionally, performing lemmatization).
(b) Create and display a bar plot for each document that include word counts for the 25 most frequent words (after the above processing).
(c) Create and display a word cloud for each document (using a mask image of your choice) that includes only the 100 most frequent words. Note that you’ll likely want to use the approach outlined in Session 25 that utilizes the fitwords
method, since you will want data consistent with those for part (b).
(d) Do you see any notable difference between the documents wrt (b) and/or (c) above? Try to explain why or why not, and whether you would expect such a difference.
Part 3:
(a) Use Textatistic to compute the average of the Flesch–Kincaid, Gunning Fog, SMOG, and Dale–Chall scores for each document.
(b) Are there noticeable differences among your documents’s readability scores, and do you suspect any difference is present (or should be present)?
Part 4:
(a) Use spaCy to compute the pairwise similarity between your documents (i.e. doc. 1 to doc. 2, doc. 1 to doc. 3, doc. 2 to doc. 3).
(b) Do any of these similarity scores seem higher or lower than you would expect? Explain your response.
Part 5:
(a) Use spaCy to find the named entities in your documents.
(b) Produce a bar plot for each document that includes the count for the 20 most common named entities (by name).
(c) Produce a second bar plot per document based on the counts of every named entity type (PERSON, ORG, etc.)
(d) Do you notice any meaningful differences (or similarities) among the documents wrt to these plots? If so, explain what they are.
Reviews
There are no reviews yet.