[SOLVED] data structure python Deadline + Late Penalty

$25

File Name: data_structure_python_Deadline_+_Late_Penalty.zip
File Size: 423.9 KB

5/5 - (1 vote)

Deadline + Late Penalty
$textbf{Note:}$ It will take you quite some time to complete this project, therefore, we earnestly recommend that you start working as early as possible. You should read the specs carefully at least 2-3 times before you start coding.
$textbf{Submission deadline for the Project (Part-1) is 20:59:59 (08:59:59 PM) on 4th Nov, 2019}$
$textbf{LATE PENALTY: 10% on day-1 and 20% on each subsequent day.}$

Instructions
1.This note book contains instructions for $textbf{COMP6714-Project (Part-1)}$. We will release the instructions for the $textbf{Part-2 of the Project}$ in a seperate notebook.
2.You are required to complete your implementation for part-1 in a file project_part1.py provided along with this notebook. Please $textbf{DO NOT ALTER}$ the name of the file.
3.You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures via corresponding functions.
4.You can submit your implementation for Project (Part-1) via submission system: http://kg.cse.unsw.edu.au/submit/ . We have already sent out the invitations for you to join the submission system. In case of problems please post your request @ Piazza.
5.For each question, we have provided you with detailed instructions along with question headings. In case of problems, you can post your query @ Piazza.
6.You are allowed to add other functions and/or import modules (you may have to for this project), but you are not allowed to define global variables. Only functions are allowed in project_part1.py
7.You should not import unnecessary and non-standard modules/libraries. Loading such libraries at test time will lead to errors and hence 0 mark for your project. If you are not sure, please ask @ Piazza.
8.We will provide immediate feedback on your submission. You can access your scores using the online submission portal on the same day.
9.For the Final Evaluation, we will be using a different dataset, so your final scores may vary.
10.You are allowed to have a limited number of Feedback Attempts $textbf{(15 Attempts for each student)}$, we will use your LAST submission for Final Evaluation.
Allowed Libraries:
You are required to write your implementation for the project (part-1) using Python 3.6.5. You are only allowed to use the following python libraries:
$textbf{spacy (v2.1.8)}$

Q1: Compute TF-IDF score for query (80 Points)

Introduction
In this project, you are required to compute $TFtext{-}IDF$ score of a document $D_{j}$ $textit{w.r.t}$ an input query $Q$ and a Dictionary of Entities $(DoE)$.
Inputs (Q1):
Inputs to your model are as follows:
1.Documents ($D$) as a dictionary with $key:$ doc_id; $value:$ document text
2.Query ($Q$), as a string of words
3.Dictionary of Entities ($DoE$), with $key:$ entity; $value:$ entity_id
The procedure for computation of the $TFtext{-}IDF$ score follows following steps:
1.$textbf{TF-IDF index construction for Entities and Tokens}$
2.$textbf{Split the query into lists of Entities and Tokens}$
3.$textbf{Query Score Computation}$
Detailed description of these steps is as under:

1. TF-IDF index construction for Entities and Tokens
We require you to separately learn TF-IDF index for tokens ($TFtext{-}IDF_{token}$) and entities ($TFtext{-}IDF_{entity}$). The computation of each of the TF-IDF index is given as follows:
TF-IDF index For Tokens:
The term frequency of the token $t$ in a document $D_{j}$ is computed as follows:
$$ TF_{token}(t,D_{j}) = {# ; of ; times ; token ; t ; appears ; in ; D_{j}} $$
To de-emphasize the high token frequency, we apply double log to normalize the term frequency. The computation of normalized term frequency of token $t$ is illustrated as follows:
$$ TF_{norm_token}(t,D_{j}) = 1.0 + ln(1.0 + ln(TF_{token}(t,D_{j}))) $$
And, the Inverse Document Frequency of the token $t$ is computed as follows:
$$ IDF_{token}(t) = 1.0 + ln(frac{total ; # ; of ; docs}{1.0 + # ; of ; docs ; containing ; token ; textit{t}}) $$
The TF-IDF score of token $t$ in document $D_{j}$ is computed as:
$$ TFtext{-}IDF_{token}(t,D_{j}) = TF_{norm_token}(t,D_{j}) * IDF_{token}(t) $$
TF-IDF index for Entities:
The term frequency of the entity $e$ in a document $D_{j}$ is computed as follows:
$$ TF_{entity}(e,D_{j}) = {# ; of ; times ; entity ; e ; appears ; in ; D_{j}} $$
We simply use natural log to normalize the term frequency of the entities, as given below:
$$ TF_{norm_entity}(e,D_{j}) = 1.0 + ln(TF_{entity}(e,D_{j})) $$
And, the Inverse Document Frequency of the entity $e$ is computed as follows:
$$ IDF_{entity}(e) = 1.0 + ln(frac{total ; # ; of ; docs}{1.0 + # ; of ; docs ; containing ; entity ; textit{e}}) $$
The TF-IDF score of the entity $e$ in the document $D_{j}$ is computed as:
$$ TFtext{-}IDF_{entity}(e,D_{j}) = TF_{norm_entity}(e,D_{j}) * IDF_{entity}(e) $$
$textbf{Note:}$ We assign TF-IDF score = 0.0 for the cases where the term frequency (TF) for the token and/or entity is ZERO.

2. Split the Query into Entities and Tokens:
At first, you are required to split the query ($Q$) into all possible combinations of free keywords, i.e., tokens ($K = {k_{i}}_{i=1}^{N}$) and entities ($E= {e_{i}}_{i=1}^{N}$), where entities correspond to a subset of entities found in $DoE$ formed by individual and/or combination of tokens in $Q$. This process is explained below:
$textbf{Step 1:}$ We look for probable entities in the $Q$ by considering individual and/or combination of query tokens formed by combining the tokens in the increasing order of the query string. Amongst them, we only select the entities present in $DoE$.$textbf{Step 2:}$ Based on the selected list of entities found in $textbf{Step-1}$ enumerate all possible subsets of entities.$textbf{Step 3:}$ Filter subsets of entities found in $textbf{Step-2}$ such that for each subset the token count does not exceed the corresponding token count in $Q$. We treat the filtered subset as the final entities of the corresponding query split.$textbf{Step 4:}$ For each filtered entity subset, the rest of the keywords in the query, i.e., $(Q setminus wordsInEntities(e_{i}))$ are treated as the tokens of the query split.
Formally, let query be a a string of tokens, e.g., $Q = ;A;B ;C ;D ;E ;F; G$ and dictionary of entities be $DoE = {AB, DF, GK}$. The list of entities formed by the tokens in the query and/or combinations of query tokens (contained in $DoE$) is $[AB, DF]$ and upon enumerating the possible subsets of the entities, we get following different possible splits of the query to the lists of the entities and the tokens:
$textbf{Split-1:}$ $e_{1} = []$; $k_{1} = [A,B,C,D,E,F,G]$
$textbf{Split-2:}$ $e_{2} = [AB]$; $k_{2} = [C,D,E,F,G]$
$textbf{Split-3:}$ $e_{3} = [DF]$; $k_{3} = [A,B,C,E,G]$
$textbf{Split-4:}$ $e_{4} = [AB, DF ]$; $k_{4} = [C,E,G]$
$textbf{Note:}$
1.In order to split the query, we only care about the subset of entities contained in $DoE$ that can be formed by individual and/or combination of tokens in the $Q$.
2.Entities in $DoE$ may correspond to single and/or multiple tokens, e.g., in the example given above $A$, $ABC$ etc., may also correspond to valid entities and may appear in the $DoE$.
3.Maximum number of query splits are upper-bounded by the subset of the entities in $DoE$ that can be formed by the tokens in the $Q$.
4.For every query split, the leftover keywords $Q setminus wordsInEntities(e_{i})$ are considered as the corresponding token split.
5.In order to form entities, we only combine keywords in the increasing order of the elements in the query string. For example, in $Q =; A; B; C; D; E; F; G;$, the entities such as: $BA$, $CB$ etc., will not be considered as entities and hence will not appear in the $DoE$.
6.In the example given above, if $DoE$ = ${AB, BC}$, then there will be only three possible splits of the query. Because the $Q$ contains only one instance of the token $B$, so it will not be possible to form a subset with multiple entities $[AB, BC]$, as it would require at least two instances of token $B$ in the $Q$ (also discussed in $textbf{Step-3}$ above ).

3. Query Score Computation:
Later, you are required to use the corresponding $TFtext{-}IDF$ index to separately compute scores for the list of tokens and entities corresponding to each query split, i.e., $(k_{i},e_{i})$, $textit{w.r.t}$ the document $D_{j}$ as follows:
$$ s_{i1} = sum_{entity in e_{i}} TF_{norm_entity}(entity,D_{j}) * IDF_{entity}(entity) \ s_{i2} = sum_{token in k_{i}} TF_{norm_token}(token,D_{j}) * IDF_{token}(token) \ score_{i}({k_{i},e_{i}}, D_{j}) = s_{i1} + lambda * s_{i2}|_{lambda = 0.4} $$
Finally, you are required to return the maximum score among all the query splits, i.e.,
$$ score_{max} = max{score_{i}}_{i=1}^{N}\ $$
Note, in the above-mentioned equations, we use two separate $TFtext{-}IDF$ indexes, i.e., ($TFtext{-}IDF_{token}$) and ($TFtext{-}IDF_{entity}$) to compute the scores for the token splits and the entity splits of the query respectively.

Some key instructions regarding TF-IDF indexing, parsing and text processing are as follows:
Key Instructions:
1.Note that for a given set of documents, you only need to index the documents only once and later use your index to compute the query scores.
2.You are only allowed to use Spacy (v2.1.8) for text processing and parsing. You can install the Spacy via following web-link: Spacy
3.We assume the parsing result of Spacy is always correct, we will not cater to any in-consistency in the Spacys parsing results.
4.All the tokens in the documents $(D)$, query $(Q)$ and dictionary of entities $(DoE)$ are case-sensitive. You $textbf{SHOULD NOT ALTER}$ the case of the tokens.
5.You are required to compute two separate indexes, i.e., (i) For tokens, and (ii) For Entities, such that:
1.In order to compute the index of the Entities (i.e., $TFtext{-}IDF_{entity}$), you should index all the entities detected by spacy irrespective of their entity types and/or presence in $DoE$. For details on spacys parsing and entity recognition, please see the web-link: Spacy Parsing
2.For single-word Entities, e.g., Trump etc., you should only compute the index corresponding to the entities. For such entities, you should not consider the corresponding token for computing the TF-IDF index of tokens.
3.For multi-word entities, e.g., New York Times etc., individual tokens corresponding to the entities should be considered as free tokens and should be indexed while TF-IDF index construction of tokens (i.e., $TFtext{-}IDF_{token}$).
Stopwords: You should only use the tokens attribute is_stop on a string parsed by Spacy to declare any token as stopword and eventually remove it. This also applies to stopwords within multi-word entities, e.g., Times of India.
Punctuation: You should only use the tokens attribute is_punct on a string parsed by Spacy to decalre any token as a punctuation mark and eventually remove it.
Special Cases: You should not explicitly strip out punctuations or amend the Spacys tokenization and parsing results. Some examples in this regard are as follows:
1.In the sentence: I am going to U.S. the correctly extracted entity is U.S.
2.Likewise, in the sentence: I am going to school. the spacy will extract the token school and will consider the fullstop . as a punctuation mark.

Toy Example for Illustration
Here, we provide a small toy example for illustration: Let the dictionary of documents ($D$) be:
documents = {1:President Trump was on his way to new New York in New York City., 2:New York Times mentioned an interesting story about Trump., 3:I think it would be great if I can travel to New York this summer to see Trump.}

The term frequencies corresponding to the tokens (i.e., $TF_{token}$) are shown below as a dictionary of dictionary of the form: ${token$ : ${doc_id: count}}$.
{President: {1: 1}, way: {1: 1}, new: {1: 1}, New: {1: 2, 2: 1, 3: 1}, York: {1: 2, 2: 1, 3: 1}, City: {1: 1}, Times: {2: 1}, mentioned: {2: 1}, interesting: {2: 1}, story: {2: 1}, think: {3: 1}, great: {3: 1}, travel: {3: 1}, summer: {3: 1}}

Likewise, The term frequencies corresponding to the entities (i.e., $TF_{entity}$) are shown below as a dictionary of dictionary of the form: ${entity$ : ${doc_id: count}}$.
{Trump: {1: 1, 2: 1, 3: 1}, New York: {1: 1, 3: 1}, New York City: {1: 1}, New York Times: {2: 1}, this summer: {3: 1}}

Let the query ($Q$) be:
Q = New York Times Trump travel

Let the $DoE$ be:
DoE = {New York Times:0, New York:1,New York City:2}

The possible query splits are:
$e_1$ = [], $k_1$ = [New, York, Times, Trump, travel]
$e_2$ = [New York Times], $k_2$ = [Trump, travel]
$e_3$ = [New York], $k_3$= [Times, Trump, travel]
$textbf{Note:}$ We cannot select the query split with the entity part as the combination of following entities: $e_{i}$ = [New York, New York Times], because there are only single instances of the tokens New and York in the $Q$.

For doc_id=3, after applying the formulas mentioned in sub-headings 2,3 given above, we get following scores for all the query splits:
query = {tokens: [New, York, Times, Trump, travel], entities: []} {tokens_score: 2.8301009632046026, entities_score: 0.0, combined_score: 1.132040385281841} query = {tokens: [Trump, travel], entities: [New York Times]} {tokens_score: 1.4054651081081644, entities_score: 0.0, combined_score: 0.5621860432432658} query = {tokens: [Times, Trump, travel], entities: [New York]} {tokens_score: 1.4054651081081644, entities_score: 1.0, combined_score: 1.562186043243266}

And the maximum score max_score among all the query splits is:
1.562186043243266
And, the corresponding query split is:
{tokens: [Times, Trump, travel], entities: [New York]}

Output Format (Q1):
Your output should be a tuple of the form:(max_score, {tokens:[], entities:[]}), where
max_score corresponds to the maximum TF-IDF score among all the query splits based on $Q$ and $DoE$.
The query split corresponding to the max_score, i.e., a python dictionary containing the tokens and entities list corresponding to the query split {tokens:[], entities:[]}.
Running Time (Q1):
On CSE machines, your implementation for $textbf{parsing and indexing}$ approx 500 documents of average length of 500 tokens $textbf{SHOULD NOT take more than 120 seconds}$.
Once all the documents are indexed, $textbf{the query spliting and score}$ computation $textbf{SHOULD NOT take more than 15 sec}$.

How we test implementation of Q1
In[1]:
import pickle
import project_part1 as project_part1
In[2]:
fname = ./Data/sample_documents.pickle
documents = pickle.load(open(fname,rb))

documents
Out[2]:
{1: President Trump was on his way to new New York in New York City.,
2: New York Times mentioned an interesting story about Trump.,
3: I think it would be great if I can travel to New York this summer to see Trump.}
In[3]:
## Step- 1. Construct the index
index = project_part1.InvertedIndex()

index.index_documents(documents)
In[4]:
## Test cases
Q = New York Times Trump travel
DoE = {New York Times:0, New York:1,New York City:2}
doc_id = 3

## 2. Split the query
query_splits = index.split_query(Q, DoE)

## 3. Compute the max-score
result = index.max_score_query(query_splits, doc_id)
result
Out[4]:
(1.562186043243266,
{tokens: [Times, Trump, travel], entities: [New York]})

Project Submission and Feedback
For project submission, you are required to submit the following files:
1.Your implementation in a python file project_part1.py.
2.A report project_part1.pdf You need to write a concise and simple report illustrating
Implementation details of $Q1$.
Note: Every student will be entitled to 15 Feedback Attempts (use them wisely), we will use the last submission for final evaluation of part-1.
In[]:

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] data structure python Deadline + Late Penalty
$25