1 Getting Started with NLTK
Before going further you should install NLTK 3.0, downloadable for free from http://www.nltk.org/ install.html. Follow the instructions there to download the appropriate version for your platform. In this course we will use Python 3.7 ONLY, so you MUST download that version. Also, you MUST download NumPy this is not optional as the website says.
(Note: Macs with Yosemite or higher have Python 2.7 as default in the terminal. You will need to run
sudo pip3 install -U nltk instead of sudo pip install -U nltk in order to install NLTK for Python 3.7. You should make a separate installation rather than updating/replacing your system default Python version.)
Once youve installed NLTK, start up the Python interpreter:
- Windows: find and open Python 3.7 (command line 32bit) in your start menu.
- Mac: type python3 and press enter in your terminal.
Install the data required for the book by typing the following two commands at the Python prompt, then selecting the book collection as shown in Figure 1 and pressing Download.
Figure 1: Installing NLTK data.
2 Working with Corpora
In this assignment you will write a program to analyze two different narrative corpora and compare them. The first corpora is a collection of 25 Aesops fables. The second corpus is a collection of 11 personal narratives told by ordinary people on their weblogs. Both corpora are provided as zip archives with each individual file containing a different fable or blog.
Your program should take one command line argument that specifies which corpus you are going to analyze (fables or blogs). The input file for the fables corpus is the zip archive containing the 26 fables as text (fables.zip). The input for the blogs corpus is the zip archive containing the 12 blogs as text
(blogs.zip). Use the stub code assignment1-stub-s15.py to get started. For each of the corpora your program should do the following:
- Tokenization
- Write the name of the corpus to stdout.
- Delimit the sentences for each document in the corpus. (c) Tokenize the words in each sentence of each document.
(d) Count the number of total words in the corpus and write the result to stdout.
- Part-of-Speech
- Apply the default part-of-speech tagger to each tokenized sentence.
- Write a file named CORPUS NAME-pos.txt that has each part-of-speech tagged sentence on a separate line and a blank newline separating documents. Where CORPUS NAME is either fables or blogs. The format of the tagging should be a word-tag pair with a / in between. For example:
The/DT boy/NN jumped/VBD ./.
- Frequency
- Write the vocabulary size of corpus to stdout. Please note that you should use the lowercased word.
- Write the most frequent part-of-speech tag and its frequency to the stdout.
- Write down the top 10 most frequent part-of-speech tags in the corpus with their frequency and relative frequency to the stdout in a decreasing order of frequency. Relative frequency should be rounded to 3 digits and can be computed by dividing the frequency by the total number of tokens in the corpus.
- Find the frequency of each unique word (after lowercasing) using the FreqDist module and write the list in decreasing order to a file named CORPUS NAME-word-freq.txt.
- Find the frequency of each word given its part-of-speech tag. Use a conditional frequency distribution for this (CondFreqDist) where the first item in the pair is the part-of-speech and the second item is the lowercased word. Note, the part-of-speech tagger requires uppercase words and returns the word/tag pair in the inverse order of what we are asking here. Use the tabulate() method of the CondFreqDist class to write the results to a file named CORPUS NAME-pos-word-freq.txt.
- Similar Words
- For the most frequent word in the NN (nouns), VBD (past-tense verbs), JJ (adjectives) and RB (adverbs) part-of-speech tags, find the most similar words using similar(). Write the output to stdout (this will happen by default).
- Collocations
- Write the collocations to the stdout.
3 Output Results
In conclusion, if you follow the steps in Section 2, your program should output the following information in standard output for each corpus:
- Corpus name:
- Total words in the corpus:
- Vocabulary size of the corpus:
- The most frequent part-of-speech tag is with frequency
- Frequencies and relative frequencies of all part-of-speech tags in the corpus in decreasing order of frequency are: tag has frequency and relative frequency .
- The most frequent word in the POS (NN/VBD/JJ/RB, note that you need to output result for each of these POS) is and its similar words are:
- Collocations:
Your python program should also output three files for each corpus. For example, for fables corpus, it should output fables-pos.txt, fables-word-freq.txt and fables-pos-word-freq.txt.
Reviews
There are no reviews yet.