5/5 - (1 vote)

The University of Melbourne

School of Computing and Information Systems

COMP90042

Natural Language Processing Final Exam

Semester 1 2023

Exam duration: 165 minutes (15 minutes reading time + 120 minutes writing time + 30 minutes upload time)

Length: This paper has 4 pages (including this cover page) and 7 questions. You should attempt all questions.

Instructions to students:

This exam is worth a total of 120 marks and counts for 40% of your final grade.
You can read the question paper on a monitor, or print it.
You are recommended to write your answers on blank A4 paper. Note that some answers require drawing diagrams or tables.
You will need to scan or take a photo of your answers and upload them via Gradescope. Be sure to label the scans/photos with the question numbers (-10% penalty for each unlabelled question, e.g. if 4 questions are unlabelled then each of these questions will be marked with a penalty of -10%).
Please answer all questions. Please write your student ID and question number on every page.

Format: Open Book
While you are undertaking this assessment you are permitted to:
- make use of the textbooks, lecture slides and workshop materials.
While you are undertaking this assessment you must not:
- make use of any messaging or communications technology;
- make use of any world-wide web or internet-based resources such as Wikipedia, Stackoverflow, Google, AI services (e.g. ChatGPT) or other web services;
- act in any manner that could be regarded as providing assistance to another student who is undertaking this assessment, or will in the future be undertaking this assessment.
The work you submit must be based on your own knowledge and skills, without assistance from any other person.

COMP90042 Natural Language Processing

Semester 1, 2023

Total marks: 120 (40% of subject) Students must attempt all questions

Section A: Short Answer Questions [33 marks]

Answer each of the questions in this section briefly. Each answer should be no longer than several sentences.

Question 1: General Concepts [24 marks]

Compare and contrast “lemmatisation” and “stemming”. With the aid of an example word, show how lemmatisation and stemming differ. [6 marks]
What is “Relation Extraction”? Your answer should explain the two different assumptions made about the set of relations. [6 marks]
Why is “topic model” evaluation difficult? Describe two intrinsic evaluation methods, and also their shortcomings (one for each method). [9 marks]
“Copy mechanism” is introduced for encoder-decoder summarisation models to handle out-of-vocabulary words in the source documents. If we use “subword tokenisation” to tokenise words, justify whether copy mechanism is still necessary. [3 marks]

Question 2: Distributional Semantics [9 marks]

Compare and contrast the two approaches as described in the lecture for learning “word vectors”.

[6 marks]
Explain one limitation of word vectors, with the aid of an example. [3 marks]

Section B: Method Questions [46 marks]

In this section you are asked to demonstrate your conceptual understanding of the methods that we have studied in this subject.

Question 3: Hidden Markov Models [20 marks]

“Text chunking” refers to the preprocessing step where we divide a text into syntactically related non- overlapping groups of words. The following is an example of a chunked sentence:

[money] [could be added] [to] [a spending bill] [covering] [the federal emergency management agency] , [which] [coordinates] [federal disaster relief] .

Show how a “Hidden Markov model” can be used to perform text chunking. In your answer, you should: (i) explain the state inventory; (ii) show how the given sentence would be labelled for training the HMM; (iii) describe how the model parameters can be learned (assuming we have a labelled corpus) and illustrate with examples from the given sentence. [12 marks]
Explain two drawbacks of using HMM for this task, and suggest a solution for each of the drawbacks

[8 marks]

Question 4: Discourse Segmentation [14 marks]

Consider the following paragraph where we are applying the “Text Tiling” algorithm described in the lecture to partition it into several segments of cohesive text spans, where “simi” denotes the cosine similarity between the two bag-of-words vectors constructed using 1 sentence on either side of gap i:

Saturn is the sixth planet from the Sun located in the Solar System.

sim0 = 0.9

sim1 = 0.3

sim2 = 0.1

sim3 = 0.2

sim4 = 0.7

It is the second largest planet in the Solar System, after Jupiter.

Saturn is the English transliteration from the Latin word Saturnus, which means the god of sowing or seed.

The Romans equated him with the Greek agricultural deity Kronos. Saturn has 82 known moons orbiting the planet.

The largest moon is Titan, which is larger than the planet Mercury.

Perform “discourse segmentation” on the paragraph, assuming t = 0.2. Your answer should show which gap a boundary will be inserted and the computation involved in producing the decision. [3 marks]
Discuss one drawback of using bag-of-words vectors for computing similarity, and illustrate how it created an erroneous boundary based on your answer in (a). You may assume that the bag-of-words vectors are created using “TF-IDF” weights with the following preprocessing steps: (1) symbols and numbers are removed; and (2) words are lowercased and tokenised using white space. [5 marks]
Propose 2 solutions that alleviate the drawback. With the aid of examples, show how the solutions will help improve discourse segmentation for the given paragraph. [6 marks]

Question 5: Ethics [12 marks]

Given the first application described in the guest lecture (automatic triaging of legal requests), discuss

three ethical implications of this application.

Section C: Algorithmic Questions [41 marks]

In this section you are asked to demonstrate your understanding of the methods that we have studied in this subject, in being able to perform algorithmic calculations.

Question 6: Context-free Grammar [23 marks]

Consider a “context-free grammar” with the following production rules:

S → NP VP NP S → NP VP PP

NP → NP PP NP → N

VP → VP NP VP → V PP → IN NP

N → he N → elephants N → trucks

V → loads IN → onto

Convert the grammar into “Chomsky normal form”. Your answer should show what rules are modified, removed, and added. [6 marks]
Given the sentence: he loads elephants onto trucks, perform “CYK parsing” on it using the converted grammar. You should include the full chart described in the lecture, which includes the edges. Your solution should produce two possible parses. [9 marks]
One of the parses is more sensible than the other. Assume that the original (i.e. unconverted) grammar above has now been modified to “probabilistic context-free grammar” to score the parses. Propose a change to the probabilistic grammar that will help the “probabilistic CYK parser” to produce a higher probability for the more sensible parse. Your answer should describe the change, use parse trees to illustrate how the change helps, and detail any assumptions that will lead to the parser producing a higher probability for the more sensible interpretation. [8 marks]

Question 7: N -gram language models [18 marks]

This question asks you to calculate the probabilities for a “N -gram language model”. You should leave your answers as fractions. Consider the following corpus, where each line is a sentence:

i wish to wish the wish you wish to wish if you won’t wish the wish i wish

i won’t wish the wish you wish to wish

Compute the probability of all bigrams given the context word wish under a “unsmoothed bigram language model”. [4 marks]
Compute the probability of all bigrams given the context word wish under a “bigram language model with absolute discounting” where the discount factor d = 0.2. [8 marks]
Compute the probability of unseen bigrams given the context word wish under a “bigram language model with Kneser-ney smoothing” where the discount factor d = 0.2. [6 marks]

— End of Exam —

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] Comp90042