[Lecturer will go through the specification together at the beginning of the lecture 11]
COMP5046 Assignment 2 [20 marks]
Question and Answering
In this assignment, you are to propose and implement a QA (Question Answering) framework using Sequence model and different NLP features. The QA framework should have the ability to read document/text and answer questions about it. The detailed information for each implementation step is specified in the following sections. Note that lab exercises would be a good starting point for the assignment. The useful lab exercises are specified in each section.
1. Data for Assignment 2 [Compulsory]
In this assignment 2, you are asked to use Microsoft Research WikiQA Corpus. The WikiQA corpus includes a set of question and sentence pairs, which is collected and annotated for research on open-domain question answering. The question sources was derived from Bing query logs, and each question is linked to a Wikipedia page that potentially has the answer. More detail of this data can be found in the paper, WikiQA: A Challenge Dataset for Open-Domain Question Answering (Yang et al. 2015).
Download datasets
You will be provided two datasets, including training dataset and testing dataset. Both dataset contain the following attributes: QuestionID, Question, DocumentID, DocumentTitle, SentenceID, Sentence, Label (answer sentence, if label=1). If you want to explore or use full dataset, you can download via the Link
Data Wrangling: You need to first wrangle the dataset. As you can see the following figure 1, each row based on each sentence of the document. You need to construct three different types of data for training the model: Question, Document and Answer. To construct the document data, you should concatenate (with space) each sentence that has same DocumentID. To construct the answer data, use the sentence that has Label as 1.
Figure1. Sample Raw Data
Note
Somequestionsdonothaveanswer.
Somedocumentshavemultiplequestions.
2. QA Framework Implementation [13 Marks]
In this assignment 2, you are to propose and implement the open-ended QA framework using word embedding, different types of feature combination, and deep learning model. The following architecture describes overview of QA framework architecture. (Click this link to view the high resolution image.)
Figure 2. Overview of the architecture of the QA framework
1) WordEmbeddingsandFeatureExtraction[5marks]
You are asked to generate the word vector by using word embedding model and different types of features. For example, in the figure 2, the word Perito is converted to the vector [word2vec, PER, 9, NNP, ].
1. Word embedding -r efer to lab2 and 10
You are to apply pre-trained word embedding model, including word2vec CBOW or skip-gram, fastText, or gloVe. For example, you can find various types of pre-trained word embedding models from the following link: https://github.com/RaRe-Technologies/gensim-data#datasets
2. Feature extraction
Different types of features should be extracted in order to improve the performance and evaluate different combinations of model specification (will discuss more in the documentation section evaluation). You are asked to extract at least four types of features from the following list:
PoS tags -refer to lab6
TF-IDF -refer to Lecture 2
Named Entity tags -refer to lab9
Dependency Path (such as, head or dependency relations) -refer to lab7
Word match feature: check whether the word appears in the question
by using decapitalisation or lemmatization -refer to lab5 and Lecture 11
2) SequenceModel(RNNwithAttention)[5marks]
-refer to Lecture 11 and Lab 11
In this QA framework, you are to implement sequence model using RNN (such as, RNN, LSTM, Bi-LSTM, GRU, Bi-GRU, etc.) with attention. Figure 2 describes the Bi-LSTM-based model as an example. As can be seen in the figure 2, your sequence model should have two input layers (one for Question and another for Document) and one output layer (Answer). The output layer should predict the answer (start token and end token). Also, your sequence model is required to include attention layer to get better performance (which needs to be presented in your architecture). The positions to add Attention layer are recommended in the blue box (top-right) of figure 2.
For the sequence model, you can use single or multiple layers but you should provide the optimal number of layers.
The detailed architecture of your sequence model needs to be drawn and described in the report (refer to section 3 documentation) . You need to justify (in the report) why you apply the specific type of RNN and put the attention layer in the specific position.
The final trained model needs to be submit in your python package.
3) Testing[3marks]
You are to implement a testing program with the trained model. When the testing program is executed, the program should show the testing result with different combinations of model specification (different features that you used in section 2.1) . The testing result should include precision, recall and F-value -refer to Lab 9.
You need to write a manual (readme) for the assessor. Your manual should guide how to test your program, and also includes a list of packages (with version) that you used. If you work on Google Colab or jupyter notebook (.ipynb), your manual should guide the assessor where to upload the required files (trained model, dataset, etc.): the assessor will use Google Colab to open your ipynb file. Unless you have a function that downloads required files from url or Google Drive.
The testing result and discussion should be described in the report (refer to section 3 documentation) .
3. Documentation [6 Marks]
Please download the Assignment 2 report (4 pages maximum) template. You should submit pdf version of the assignment 2 report.
4. Programming (coding) styles [1 Mark]
Your program needs to be easily readable and well commented. The followings are expected to be satisfied:
Readability: Easy to read and maintain
Consistency & Naming: Names are consistent in style
Coding Comments: Comments clarify meaning where needed
Robustness: Handles erroneous or unexpected input
Assignment 2 Submission Method Due date: 11:59PM, Sunday 09 June 2019
Submission: Canvas Assignment 2 Submission Box You must submit two files:
1. zip file (for section 1 and 2): program (either python package or ipynb file), trained model, dataset, manual (readme) and all other required files.
file name: your_unikey_Ass2.zip
2. pdf file (for section 3): a report with the given template file name: your_unikey_Ass2.pdf
Reviews
There are no reviews yet.