The Tasks (1 mark each)
1. Find the top stems
Implement a functionget_top_stemsthat returns a list with thenmost frequent stemswhich is not in the list of NLTK stopwords. To determine whether a word is a stop word, remember to lowercase the word. The list must be sorted by frequency in descending order and the words must preserve the original casing. The input arguments of the function are:
document: The name of the Gutenberg document, e.g.austen-emma.txt.
n: The number of stems to return.
To produce the correct results, the function must do this:
Use the NLTK libraries to find the tokens and the stems.
Use NLTKs sentence tokeniser before NLTKs word tokeniser.
Use NLTKs list of stop words, and compare your words with those of the list after lowercasing.
2. Find the top PoS bigrams
Implement a functionget_top_pos_bigramsthat returns a list with thenmost frequent bigrams of parts of speech. Do not remove stop words. The list of bigrams must be sorted by frequency in descending order. The input arguments are:
document: The name of the Gutenberg document, e.g.austen-emma.txt.
n: The number of bigrams to return.
To produce the correct results, the function must do this:
Use NLTKspos_tag_sentsinstead ofpos_tag.
Use NLTKsuniversalPoS tagset.
When computing bigrams, do not consider parts of speech of words that are in different sentences. For example, if we have this text: Sentence 1. And sentence 2 the bigrams are:(NOUN,NUM), (NUM,.), (CONJ,NOUN), (NOUN,NUM). Note that this would not be a valid bigram, since the punctuation mark and the word And are in different sentences:(.,CONJ).
3. Find the distribution of frequencies of parts of speech after a given word
Implement a functionget_pos_afterthat returns the distribution of the parts of speech of the words that follow a word given as an input to the function. The result must be returned in descending order of frequency. The input arguments of the function are:
document: The name of the Gutenberg document, e.g.austen-emma.txt.
word: The word.
To produce the correct results, the function must do this:
First do sentence tokenisation, then word tokenisation.
Do not consider words that occur in different sentences. Thus, if a word ends a sentence, there are no words following it.
4. Get the words with highest tf.idf
In this exercise you will implement a simple approach to find keywords in a document.
Implement a functionget_top_word_tfidfthat returns the list ofnwords with highest tf.idf. The result must be returned in descending order of tf.idf. The input arguments are:
document: The name of the Gutenberg document, e.g.austen-emma.txt.
n: The number of words to return.
To produce the correct results, the function must do this:
Use Scikit-learnsTfidfVectorizer.
Fit the tf.idf vectorizer using the documents of the NLTK Gutenberg corpus.
5. Get the sentences with highest average of tf.idf
In this exercise you will implement a simple document summariser that returns the most important sentences based on the average tf.idf.
Implement a functionget_top_sentence_tfidfthat returns the positions of the sentences which have the largest average tf.idf.The list of sentence positions must be returned in the order of occurrence in the document. The input arguments are:
document: The name of the Gutenberg document, e.g.austen-emma.txt.
n: The number of sentence positions to return.
The reason for returning the sentence positions in the order of occurrence, and not in order of average tf.idf, is that this is what document summarisers normally do.
To produce the correct results, the function must do this:
Use Scikit-learnsTfidfVectorizer.
Fit the tfidf vectorizer using the sentences of the documents of the NLTK Gutenberg corpus. This is different from task 4. Now you want to compute the tf.idf of sentences, not of documents.
Use NLTKs sentence tokeniser to find the sentences.
Anaconda Navigtor jupyter.lab python 3 Note Book
Import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import numpy as np
nltk.download(punkt)
nltk.download(gutenberg)
from sklearn.feature_extraction.text import TfidfVectorizer
import collections
# Task 1 (1 mark)
import collections
def get_top_stems(document, n):
Return a list of the n most frequent stems of a Gutenberg document, sorted by
frequency in descending order. Dont forget to remove stop words before counting
the stems.
>>> get_top_stems(austen-emma.txt, 10)
[,, ., , , ;, , mr., !, s, emma]
>>> get_top_stems(austen-sense.txt, 7)
[,, ., , ;, , , elinor]
return []
# Task 2 (1 mark)
def get_top_pos_bigrams(document, n):
Return the n most frequent bigrams of parts of speech. Return the list sorted in descending order of frequency.
The parts of speech of words in different sentences cannot form a bigram. Use the universal pos tag set.
>>> get_top_pos_bigrams(austen-emma.txt, 3)
[(NOUN, .), (PRON, VERB), (DET, NOUN)]
return []
# Task 3 (1 mark)
def get_pos_after(document, word):
Return the distribution of frequencies of the parts of speech occurring after a word. Return the result sorted by
frequency in descending order. Do not consider words that occur in different sentences. Use the
universal pos tag set.
>>> get_pos_after(austen-emma.txt,the)
[(NOUN, 3434), (ADJ, 1148), (ADV, 170), (NUM, 61), (VERB, 24), (., 7)]
return []
# Task 4 (1 mark)
def get_top_word_tfidf(document, n):
Return the list of n words with highest tf.idf. The reference for computing
tf.idf is the NLTK Gutenberg corpus. The list of words must be sorted by frequency
in descending order.
>>> get_top_word_tfidf(austen-emma.txt, 3)
[emma, mr, harriet]
return []
# Task 5 (1 mark)
def get_top_sentence_tfidf(document, n):
Return the positions of the n sentences which have the largest average tf.idf. The list of sentences
must be returned in the order of occurrence in the document. The reference for computing
tf.idf is the list of sentences from the NLTK Gutenberg corpus.
>>> get_top_sentence_tfidf(austen-emma.txt, 3)
[5668, 5670, 6819]
return []
# DO NOT MODIFY THE CODE BELOW
if __name__ == __main__:
import doctest
doctest.testmod(optionflags=doctest.ELLIPSIS)
Reviews
There are no reviews yet.