5/5 - (1 vote)

The Tasks (1 mark each)
1. Find the top stems
Implement a functionget_top_stemsthat returns a list with thenmost frequent stemswhich is not in the list of NLTK stopwords. To determine whether a word is a stop word, remember to lowercase the word. The list must be sorted by frequency in descending order and the words must preserve the original casing. The input arguments of the function are:
document: The name of the Gutenberg document, e.g.austen-emma.txt.
n: The number of stems to return.
To produce the correct results, the function must do this:
Use the NLTK libraries to find the tokens and the stems.
Use NLTKs sentence tokeniser before NLTKs word tokeniser.
Use NLTKs list of stop words, and compare your words with those of the list after lowercasing.
2. Find the top PoS bigrams
Implement a functionget_top_pos_bigramsthat returns a list with thenmost frequent bigrams of parts of speech. Do not remove stop words. The list of bigrams must be sorted by frequency in descending order. The input arguments are:
document: The name of the Gutenberg document, e.g.austen-emma.txt.
n: The number of bigrams to return.
To produce the correct results, the function must do this:
Use NLTKspos_tag_sentsinstead ofpos_tag.
Use NLTKsuniversalPoS tagset.
When computing bigrams, do not consider parts of speech of words that are in different sentences. For example, if we have this text: Sentence 1. And sentence 2 the bigrams are:(NOUN,NUM), (NUM,.), (CONJ,NOUN), (NOUN,NUM). Note that this would not be a valid bigram, since the punctuation mark and the word And are in different sentences:(.,CONJ).
3. Find the distribution of frequencies of parts of speech after a given word
Implement a functionget_pos_afterthat returns the distribution of the parts of speech of the words that follow a word given as an input to the function. The result must be returned in descending order of frequency. The input arguments of the function are:
document: The name of the Gutenberg document, e.g.austen-emma.txt.
word: The word.
To produce the correct results, the function must do this:
First do sentence tokenisation, then word tokenisation.
Do not consider words that occur in different sentences. Thus, if a word ends a sentence, there are no words following it.
4. Get the words with highest tf.idf
In this exercise you will implement a simple approach to find keywords in a document.
Implement a functionget_top_word_tfidfthat returns the list ofnwords with highest tf.idf. The result must be returned in descending order of tf.idf. The input arguments are:
document: The name of the Gutenberg document, e.g.austen-emma.txt.
n: The number of words to return.
To produce the correct results, the function must do this:
Use Scikit-learnsTfidfVectorizer.
Fit the tf.idf vectorizer using the documents of the NLTK Gutenberg corpus.
5. Get the sentences with highest average of tf.idf
In this exercise you will implement a simple document summariser that returns the most important sentences based on the average tf.idf.
Implement a functionget_top_sentence_tfidfthat returns the positions of the sentences which have the largest average tf.idf.The list of sentence positions must be returned in the order of occurrence in the document. The input arguments are:
document: The name of the Gutenberg document, e.g.austen-emma.txt.
n: The number of sentence positions to return.
The reason for returning the sentence positions in the order of occurrence, and not in order of average tf.idf, is that this is what document summarisers normally do.
To produce the correct results, the function must do this:
Use Scikit-learnsTfidfVectorizer.
Fit the tfidf vectorizer using the sentences of the documents of the NLTK Gutenberg corpus. This is different from task 4. Now you want to compute the tf.idf of sentences, not of documents.
Use NLTKs sentence tokeniser to find the sentences.

Anaconda Navigtor jupyter.lab python 3 Note Book
Import nltk

from nltk.tokenize import sent_tokenize, word_tokenize

import numpy as np

nltk.download(punkt)

nltk.download(gutenberg)

from sklearn.feature_extraction.text import TfidfVectorizer

import collections

# Task 1 (1 mark)

import collections

def get_top_stems(document, n):

Return a list of the n most frequent stems of a Gutenberg document, sorted by

frequency in descending order. Dont forget to remove stop words before counting

the stems.

>>> get_top_stems(austen-emma.txt, 10)

[,, ., , , ;, , mr., !, s, emma]

>>> get_top_stems(austen-sense.txt, 7)

[,, ., , ;, , , elinor]

return []

# Task 2 (1 mark)

def get_top_pos_bigrams(document, n):

Return the n most frequent bigrams of parts of speech. Return the list sorted in descending order of frequency.

The parts of speech of words in different sentences cannot form a bigram. Use the universal pos tag set.

>>> get_top_pos_bigrams(austen-emma.txt, 3)

[(NOUN, .), (PRON, VERB), (DET, NOUN)]

return []

# Task 3 (1 mark)

def get_pos_after(document, word):

Return the distribution of frequencies of the parts of speech occurring after a word. Return the result sorted by

frequency in descending order. Do not consider words that occur in different sentences. Use the

universal pos tag set.

>>> get_pos_after(austen-emma.txt,the)

[(NOUN, 3434), (ADJ, 1148), (ADV, 170), (NUM, 61), (VERB, 24), (., 7)]

return []

# Task 4 (1 mark)

def get_top_word_tfidf(document, n):

Return the list of n words with highest tf.idf. The reference for computing

tf.idf is the NLTK Gutenberg corpus. The list of words must be sorted by frequency

in descending order.

>>> get_top_word_tfidf(austen-emma.txt, 3)

[emma, mr, harriet]

return []

# Task 5 (1 mark)

def get_top_sentence_tfidf(document, n):

Return the positions of the n sentences which have the largest average tf.idf. The list of sentences

must be returned in the order of occurrence in the document. The reference for computing

tf.idf is the list of sentences from the NLTK Gutenberg corpus.

>>> get_top_sentence_tfidf(austen-emma.txt, 3)

[5668, 5670, 6819]

return []

# DO NOT MODIFY THE CODE BELOW

if __name__ == __main__:

import doctest

doctest.testmod(optionflags=doctest.ELLIPSIS)

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] CS python

Reviews

Whatsapp Us

[SOLVED] CS python

Reviews

Related products

[SOLVED] Project 9-1: Monthly Payment Calculator

[Solved] Modify your first program to print a table of the words

[Solved] Car.py

[Solved] List Maintainer

[SOLVED] COP 3223 Program #1: Vacation Planning

[Solved] Payroll calculation program-Python