[SOLVED] NLPNLE Coursework 1

$25

File Name: NLPNLE_Coursework_1.zip
File Size: 178.98 KB

5/5 - (1 vote)

NLE Coursework 1


Important Information

Submission format:
You should submit just one file that should be a Jupyter notebook.

Due date:
Your notebook should be submitted on Study Direct before 4pm on Wednesday 30th November. The standard late penalties will apply.

Return date:
Marks and feedback will be provided on Sussex Direct on Friday December 21st for all submissions that are submitted by the due date.

Weighting:
This assessment contributes 50% of the mark for the module.

Word limit:
Your Jupyter notebook should contain no more than 3000 words. The standard penalties apply for deviations from this limit (see below). You must specify the number of words in your report.

Failure to observe limits of length

The following is taken from the Examination and Assessment Regulations Handbook

The maximum length for each assessment is publicised to students. The limits as stated include quotations in the text, but do not include the bibliography, footnotes/endnotes, appendices, abstracts, maps, illustrations, transcriptions of linguistic data, or tabulations of numerical or linguistic data and their captions. Any excess in length should not confer an advantage over other students who have adhered to the guidance. Students are requested to state the word count on submission. Where a student has marginally (within 10%) exceeded the word length the Marker should penalise the work where the student would gain an unfair advantage by exceeding the word limit. In excessive cases (>10%) the Marker need only consider work up to the designated word count, and discount any excessive word length beyond that to ensure equity across the cohort. Where an assessment is submitted and falls significantly short (>10%) of the word length, the Marker must consider in assigning a mark, if the argument has been sufficiently developed and is sufficiently supported and not assign the full marks allocation where this is not the case.

Note that code and the content of cell outputs are excluded from the word count.


Overview

For this assignment, you are asked to write a report on some of the activities covered in the notebooks for Topic 3 and Topic 6.

  • Topic 3 concerned the classification of Amazon product reviews according to whether the overall sentiment being expressed was positive or negative. You explored factors that have an impact on classifier accuracy.
  • Topic 6 involved extending an opinion extractor function so that it deals with a variety of ways that opinions are expressed in Amazon DVD reviews.

For this report you should create a Jupyter notebook containing the following sections (further details regarding the content of each section are given below).

Section 1 :
A report on what you discovered while undertaking some of the exercises in the Topic 3 notebook.

Section 2 :
The python code for your opinion extractor and a description of how it works. This section should also include a demonstration that your opinion extractor produces the correct output for the example sentences that we have given for each of the required extensions.

Section 3 :
An assessment of the limitations of your extractor based on applying it to your own personalised random sample of 100 sentences from the corpus of DVD amazon reviews (see below for details of how to obtain your personalised random sample).


Details of Requirements

Your submission will be marked out of 100. Please read the following guidelines carefully.

Section 1: Document Level Sentiment Analysis

There are 30 marks available for this section. Section 1 should include the following four subsections.

Section 1.1 (7 marks)

You investigated the differences in accuracy between the wordlist classifier performance and the Nave Bayes classifier performance. In this subsection, describe what you found, and, by looking at classifier performance on a sample of product reviews (it is up to you to decide how many), discuss the reasons behind the differences in performance of the methods being compared.

Section 1.2 (7 marks)

For the methods that require training data, you investigated the impact on classifier accuracy when the amount of training data is varied? In this subsection you should discuss your findings, and then attempt to predict what the accuracy of the Nave Bayes classifier would be if you massively increased the amount of training data.

Section 1.3 (8 marks)

You investigated the impact on classifier accuracy of training a classifier with data in one domain (the source domain), and testing the same classifier on data from a different domain (the target domain). In this subsection you should describe what you found, and then investigate the reasons why some source to target changes in domain have more impact on classifier accuracy than others.

Section 1.4 (8 marks)

In this subsection you should discuss what you found when investigating the impact on classifier accuracy of various feature extraction methods.

Things to note

  • Quality of your experimental method
    • make sure that you make an appropriate division of labelled data for training and testing
    • make sure that you have repeated experiments with different random samples and presented averaged results
  • Presentation of results
    • graphs or tables should be used when presenting your empirical findings
    • when you are making a comparison between two or more results, you should display them together on the same graph so that the comparison can be seen directly.
    • in cases where your empirical investigation is not complete, be explicit as to what you have not managed to achieve.
  • Quality of analysis
    • do not just present tables of numbers and graphs without any discussion. You need to put forward a reasonable explanation as to why you have observed the results that you have obtained.
    • Always back up claims with evidence. Be careful not to make bold conclusions from small-scale testing.

Section 2: Opinion Extractor

There are 30 marks available for this section. Section 2 should include the following subsections.

Section 2.1 (20 marks)

In this subsection you should present the code defining your opinion extractor together with a clear explanation as to how it works. You only need to cover the five required extensions described in the Topic 6 notebook.

Section 2.2 (10 marks)

In this subsection you should show that for each of the five required extensions, A-E, your opinion extractor produces the desired output when given the corresponding core sentence(s), as defined here:

core = [("A.1","It has an exciting fresh plot.",set(["fresh", "exciting"])), ("B.1","The plot was dull.",set(["dull"])),("C.1","It has an excessively dull plot.",set(["excessively-dull"])),("C.2","The plot was excessively dull.",set(["excessively-dull"])),("D.1","The plot wasn't dull.",set(["not-dull"])),("D.2","It wasn't an exciting fresh plot.",set(["not-exciting", "not-fresh"])),("D.3","The plot wasn't excessively dull.",set(["not-excessively-dull"])),("E.1","The plot was cheesy, but fun and inspiring.",set(["cheesy", "fun", "inspiring"])),("E.2","The plot was really cheesy and not particularly special.",set(["really-cheesy", "not-particularly-special"])) ]

Things to note

  • Do not write separate opinion extractor functions for each of the different extension. It is important that all of the different extensions are integrated, so that a single opinion extractor function deals with all of the cases you are covering.
  • You should state explicitly whether or not you have not been successful in extending your opinion extractor in the ways required. Note that we will be running and testing your code in order to establish whether the claims that you make are valid.
  • Do not forget to include comments in your code where necessary. Split long functions into smaller coherent sub-functions.

Section 3: Assessment of Opinion Extractor Performance

There are 40 marks available for this section. Section 3 should include the following subsections.

Section 3.1 (5 marks)

In Section 3, you will be making an assessment of the effectiveness of your opinion extractor on your own personalised random sample of 100 DVD reviews. In this subsection you are asked to give details of the makeup of your personalised sample.

The following code cell contains code that you should use to produce your personalised sample.

It is absolutely essential that you change the first line of the code in this cell so that your own candidate number replaces the number 12345678

If you do not use your own candidate number to generate the sample then you will receive zero marks for Section 3 of the coursework, i.e. a loss of 40 marks!

Note that the code cell below presumes that the first two code cells of the notebook for Topic 6 have been run giving an appropriate value for dvd_reviews .

Once you have inserted your candidate number and then run this cell, my_sample will be a list of 100 randomly chosen sentences that contain of the target tokens.

In this subsection, you should report on how many of the sentences in my_sample contain each of the target tokens, "plot" , "characters" , "cinematography" , and "dialogue" .

In[]:

seed(12345678) # you ***MUST*** replace '12345678' with your candidate number 

def target_sentence(sentence,target_tokens):
for token in sentence:
if token.orth_ in target_tokens:
return True
return False

target_tokens = {"plot","characters","cinematography","dialogue"}
sample_size = 100
my_sample = []
num_found = 0
while num_found < sample_size:
review = random.choice(dvd_reviews)
parsed_review = nlp(review)
sentence = random.choice(list(parsed_review.sents))
if target_sentence(sentence,target_tokens):
my_sample.append(sentence)
num_found += 1

Section 3.2 (30 marks)

In this subsection, you should present an assessment as to how well your opinion extractor performs on the 100 sentences in my_sample .

  • You should identify the number of sentences where each of your five extensions applied.
    Give illustrative examples taken from my_sample.
  • You should indicate what proportion of the sentences that were correctly analysed by the opinion extractor.
    Give illustrative examples from my_sample.
  • You should report on those cases where the opinion extractor does not produce a desirable output, identifying the reason for the error. Examples of possible reasons are that it is due to a deficiency (that you should describe) in your opinion extractor algorithm, an error or errors made by the part-of-speech tagger, or an error or errors made by the dependency parser.
    Do not restrict yourself to just these types of errors, identifying other types of errors as needed.
    Illustrate each kind of error with examples taken from my_sample.

Section 3.3 (5 marks)

In this subsection, you should discuss ways in which your opinion extractor could be improved to overcome the problems that you have observed in Subsection 3.2.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] NLPNLE Coursework 1
$25