University of Toronto, Department of Computer Science
CSC 485/2501FComputational Linguistics, Fall 2016
Assignment 3
Due date: 23:59, Wednesday 7 December 2016, at the course drop box in BA 2220.
Late assignments will not be accepted without a valid medical certificate or other documen- tation of an emergency.
This assignment is worth either 25% (CSC 2501) or 33% (CSC 485) of your final grade.
- Fill out both sides of the assignment cover sheet, and staple together all answer sheets (in order) with the cover sheet (sparse side up) on the front. (Dont turn in a copy of this handout.)
- Please type your answers in no less than 12pt font; diagrams and tree structures may be drawn with software or neatly by hand.
- What you turn in must be your own work. You may not work with anyone else on any of the problems in this assignment, except for discussion in very general terms. If you need assistance, contact the instructor or TA.
- Anyclarificationstotheproblemswillbepostedonthecoursewebsiteannouncements page. You will be responsible for taking into account in your solutions any information that is posted there, or discussed in class, so you should check the page regularly between now and the due date.
1. Word sense disambiguation [20 marks]
Read section 2.5 of Bird et al. (pages 6773 of the printed edition) for a quick intro- duction to the NLTK interface to WordNet, and try out some of the examples. For a quick overview of the content of WordNet, you might also find Princetons Web interface to be useful: wordnetweb.princeton.edu:/perl/webwn
A. (14 marks)
With NLTK, and using WordNet as your dictionary, write a program that uses the Lesk algorithm as described in the lecture notes to disambiguate the part-of-speechtagged words (in boldface) in the following sentence:
A salad/N with greens/N and tomato/N is a healthful addition/N to any meal/N, but add an avocado/N and you have something really special/Adj.
In your results, indicate for each word whether you agree or disagree with the algorithms decision, and if you disagree then say what you think the correct answer is.
To keep things simple, take the entire sentence as the context window, but ignore the other words of the sentence; i.e., construct your bags of words only from the definitions of the seven boldface words that are themselves to be disambiguated. Do not include uninformative words (prepositions, conjunctions, modal verbs, copulas, etc) in your bags. Assume that two
words match if they have the same lemma (e.g., tomatoes matches tomato).
So at the simplest level, your input will be a set of word/PoS pairs and your output will a synset-number for each. In addition, for each synset of the seven words, print its gloss, the count of words that overlap, and the actual overlaps. This trace of your systems operation will help in debugging your code and also in answering part B below. Your output should
use the following format for each word that your system disambiguates:
[input-word/POS][system-prediction][target-word#1]<tab>[gloss]<tab>[count]<tab>[words-that-overlap][target-word#2]<tab>[gloss]<tab>[count]<tab>[words-that-overlap]...
[Description of whether you agree with the system prediction.]
For example, the format of the output for tomato might look like this (this is just an example showing the format; your systems actual output might be very different!):
tomato/Ntomato#2tomato#1mildly acid red or yellow pulpy fruit eaten as a vegetable3mildly, acid, rtomato#2native to South America; widely cultivated in many varieties4native, south
I disagree with the systems output because ...
2
Hint: Recall that in WordNet, the synset associated with the sense of a word has three com- ponents that can be included in the bags of words: the other words that also have a sense in the synset; the gloss of the sense (the actual definition); and, usually, and one or more examples of usage.
Hint: For lemmatizing in WordNet, see section 3.6 of Bird et al.
Hint: For a list of uninformative words, NLTK offers a suggested list of stopwords high-frequency low-information words that would normally be stripped out of any lexical- semantic analysis.
nltk.corpus.stopwords.words(fileids=english)
You may use this list, and may add to it or subtract from it at your discretion. But tell us what you added and/or subtracted in the report you write in part B.
B. (6 marks)
In light of the information in the trace from your program of part A, briefly discuss the
following:
- To what extent does the fine-grainedness of WordNet help or hinder the algorithm? What errors, if any, in the results can be attributed to WordNets fine-grainedness?
- What changes, if any, would you make to the WordNet definitions that are used in this exercise? Are there any senses of the words in this exercise that you would delete or add? What errors, if any, in the results can be attributed to infelicities in WordNets synsets or definitions?
- To what extent did the inclusion in the bags of words of the other words in the synset and the examples of usage help or hinder the algorithm? What errors, if any, in the results can be attributed to including these components? What errors, if any, were averted by including them (that is, what cases did they save from error)?
- Did you use the NLTK stopword list? What changes or alternatives did you use, if any?
3
2. Discovering hyperonym relations [40 marks]
The goal of this question is to replicate the system for discovering new lexical relations by Marti Hearst that was described in lecture notes number 7.
With NLTK and the resources described below, and using Hearsts lexicosyntactic pat- terns, build and evaluate a system for finding suggested hyperonym / hyponym relations between words in a corpus of text.
For each suggested relation that you find, maintain a simple measure of confidence in its correctness as the count of the number of times you find the same suggestion. (Hearst did not do this.) If the relation is found just once, it is low confidence; if its found twice, call it medium confidence, and if its found at least three times, call it high confidence.
A Python module Asst3 is available on teach.cs to help you get started.
Corpus: A corpus of about 2.6 million sentences (60 million words) from the 2004 New York Times,* tagged with parts of speech, is available at teach.cs:/u/csc485h/ include/a3/nltk/corpora/nyt.zip. The corpus is about 145 Mb in compressed form, and three times that if uncompressed. But there is no need to uncompress it, as the compressed file can be read directly by NLTK. You can use Asst3 to access the corpus through NLTK as follows:
import syssys.path.append(/u/csc485h/include/a3)from Asst3 import nyt_big, nyt_mini
for s in nyt_big.tagged_sents():# do something with s
The Python object nyt mini is a small development corpus that contains 100,000 sentences selected from nyt big. We have biased the selection towards sentences that are likely to match the patterns mentioned in this hand-out. You can use it for developing, tuning, and debugging your code. For your final submission, you must use nyt big.
Chunker: The NYT corpus has already been part-of-speech tagged but not partially parsed or chunked. So you will need to use NLTKs regular-expression chunk parser to get the data into a form ready for pattern-matching. The code snippet below creates a regular-expression chunk-parser, then uses it to parse sentences of the NYT corpus. A chunking rule for NPs, DefaultNpPattern, is available in Asst3:
*The data are from the New York Times Annotated Corpus, distributed by the Linguistic Data Consortium and licensed by the University of Toronto. The data is owned by The New York Times Company, who gener- ously make it available for research and education, and is protected by copyright. You must not copy the data, or use it for purposes other than this assignment.
Thanks to Paul Cook for the tagging and Uli Germann for conversion to the format required by NLTK.
4
DefaultNpPattern = .join([r(<DT|AT>?<RB>?)?,r<JJ.*|CD.*>*,
r(<JJ.*|CD.*><,>)*,r(<N.*>)+])
However, you might wish to modify the rule, so to understand this code you should read section 7.2 of Bird et al., and you should consult the API documentation to know what to do with the Tree objects produced by the chunker. (The code below is available in teach.cs:/u/csc485h/include/a3/A3-sample-code.py.)
import sys# only necessary once in your codesys.path.append(/u/csc485h/include/a3) # only necessary once in your codefrom Asst3 import nyt_big # only necessary once in your code
from nltk.chunk.regexp import *from Asst3 import DefaultNpPatternfrom nltk import Tree
rule = ChunkRule(DefaultNpPattern, Chunk NPs)parser = RegexpChunkParser([rule],
chunk_node=NP, # the name to assign to matched substrings (optional)top_node=S)# the name to assign the top tree node(optional)
taggedText = nyt_big.tagged_sents()for t in taggedText:
chunked = parser.parse(t)# do something with the chunked sentence
Programming hint: The full NYT corpus is so big that you will run into memory problems if you try to generate a list of chunked sentences with code like this:
dont_chunk_like_this = [parser.parse(x) for x in nyt_big.tagged_sents()]
Instead, process each sentence individually, as shown in the sample code above.
Lexicosyntactic patterns: You should use Hearsts patterns,* which are shown in Table 1, and add your own if you wish. Note that in the patterns, elements in braces are optional they may be present or not. But elements separated by a | in braces require a choice of exactly one either and or or, but not both or neither. Adapt the patterns as necessary for your code, and make sure that they are well-documented in your code.
A particular point to note about rule number 3: When you chunk your text, youll find that other is grouped as part of the NP (as it should be). So strictly speaking, you are not looking for other NPj, but for an NPj that begins with other; and then youll have to remove the other when extracting the hyperonym. The same may be true, in all rules, for determiners such as the, several, and some (e.g., university officials, including some deans, several chairs, and the provost).
*Hearst, Marti. Automated discovery of WordNet relations. In: Fellbaum, Christiane (editor), WordNet: An electronic lexical database. The MIT Press, 1998, pages 131151.
5
Table 1. Hearsts patterns for discovering hyponyms in text.
- Pattern: NP0{,} such as NP1{,NP2,,{and|or}NPj}
Extracted hyponyms: For each NPi (1 i j), HYPONYM(NPi,NP0)
Example: Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.
HYPONYM(Gelidium, red algae) - Pattern: such NP0 as NP1{,NP2,,{and|or}NPj}
Extracted hyponyms: For each NPi (1 i j), HYPONYM(NPi,NP0) Example: . . . works by such authors as Herrick, Goldsmith, and Shakespeare. HYPONYM(Herrick, author)
HYPONYM(Goldsmith, author)
HYPONYM(Shakespeare, author) - Pattern: NP0{,NP1,,} {and | or} other NPj
Extracted hyponyms: For each NPi (0 i j1), HYPONYM(NPi,NPj) Example: Bruises, lacerations, or other injuries . . . .
HYPONYM(bruise, injury)
HYPONYM(lacerations, injury)
Example: . . . bistros, coffee shops, and other cheap eating places. HYPONYM(bistro, eating place)
HYPONYM(coffee shop, eating place) - Pattern: NP0{,} including NP1{,NP2,,{and|or}NPj}
Extracted hyponyms: For each NPi (1 i j), HYPONYM(NPi,NP0) Example: All common law countries, including Canada and England . . . . HYPONYM(Canada, common law country)
HYPONYM(England, common law country) - Pattern: NP0{,} especially NP1{,NP2,,{and|or}NPj}
Extracted hyponyms: For each NPi (1 i j), HYPONYM(NPi,NP0) Example: . . . most European countries, especially France, England, and Spain. HYPONYM(France, European country)
HYPONYM(England, European country)
HYPONYM(Spain, European country)
6
Hint: Along with simple words, WordNet also contains synsets for commonly used phrases, called WordNet compounds. For examples in Table 1, they include red algae, coffee house, and eating place, but not cheap eating place or common law country. It is up to you to handle these as you see fit.
Evaluation and report: The evaluation of your system will necessarily be only semi- automatic. For each suggested pair your system finds, count how often each of the following holds (count separately for each confidence level):
1. BothwordsarealreadyinWordNet,andtherelationholdsbetweenoneormoresenses of each.
2. Both words are already in WordNet, but the relation is contradicted by WordNet for at least one sense of each word. For example, your program might suggest that an animal is a dog when WordNet already says that a dog is an animal.
3. Both words are already in WordNet, but the relation is not present.
4. One or both of the words is missing from WordNet.
Cases 1 and 2 are confirming and disconfirming cases, respectively, and can be counted automatically. Cases 3 and 4 need to be evaluated by a human judge. Choose (at least) 50 examples of each case from your data (or all that you obtain, if fewer than 50), trying to include suggestions from all of your patterns. Score each of these suggestions as either correct, incorrect, or uncertain, according to your best judgement. Try to include both high- confidence and low-confidence examples in your set.
Write a short report on your results, discussing both the successes and shortcomings of of your program. Be sure to document the additional patterns that you used.
Hint: It will take your program a fair amount of time to work through the entire New York Times corpus. Develop and test your program on the small dataset nyt mini before turning it loose on the complete corpus.
Hint: False positives (and false negatives) can provide interesting insights and help you improve your algorithm. For example, instead of just tossing out false positives as wrong during the evaluation, think (and write in your report!) about what went wrong. Is the in- stance a hopeless case, or could a small (or even big) change in your set-up help get things right? We do not expect you to implement such changes, especially if they are non-trivial. But we do want you to think about them. For analyzing errors, it is often worthwhile to look at things in context. For manual evaluation (and debugging), consider printing out the entire sentence in addition to the candidates to be evaluated, so that you can look at them in context. But for your submission, print out only the suggestions from your program without context.
7
What to submit
What to hand in on paper: Staple together (no paperclips or folded corners, please) the
following items in the following order:
- The project cover sheet (attached to this handout).
- Your code for part A of the first question, called lesk.py. Your code should bind the input sentence to a global string variable near the beginning of this file, called sentence. Your code should directly use our instance of the WordNet database in teach.cs:/u/csc485h/include/a3/nltk/corpora/wordnet.
- The plaintext output of your program, called leskoutput.txt, with descriptions of your (dis)agreements added by you.
- Your report for part B of the first question.
- Your code for the second question, called hearst.py. Your code should bind the chunking rule that you use to the global variable NpPattern near the top of this file, even if you choose to use the default pattern from Asst3. Again, refer to the New York Times in the classs instance. Do not copy this file.
- A printout of the (approximately 200) suggestions from your program for the second question that you use in your evaluation, called hearstoutput.txt.
- Your report for the second question, including your evaluation of the suggestions.
What to submit electronically In addition to your paper submission, please submit elec- tronically all of your code and its text output for both questions. Submit the files using the submit command on teach.cs:
% submit -c <course> -a A3 <filename-1>...<filename-n>
where <course> is either csc485h for undergraduates or csc2501h for graduate stu- dents, and <filename-1> to <filename-n> are the n files you are submitting. Make sure every file you turn in contains a comment at the top that gives your name, your login ID on teach.cs, and your student ID number.
8
CSC 485 / 2501, Fall 2016: Assignment 3 Family name: First name:
Staple to assignment this side up
CSC 485 / 2501, Fall 2016: Assignment 3
Family name: First name: Student #: Date:
I declare that this assignment, both my paper and electronic submissions, is my own work, and is in accordance with the University of Toronto Code of Behaviour on Academic Matters and the Code of Student Conduct.
Signature:
Grade:
- / 20
- / 40
TOTAL / 60
Reviews
There are no reviews yet.