Name: [Solved] CSE225 PROJECT #1-Text representation with Higher-Order Paths
Brand: Assignment Chef
SKU: [Solved] CSE225 PROJECT #1-Text representation with Higher-Order Paths
Price: 25 USD
Availability: InStock
Rating: 5 (1 reviews)

5/5 - (1 vote)

Advantages of using higher-order paths [1] between documents are illustrated in Fig. 1. In this figure, there are three documents, d₁, d₂ and d₃ which include a set of terms {t₁, t_2, t₃}, {t₃, t₄, t₅} and {t₄, t₅} respectively. Using a traditional similarity measure which is based on the shared terms (e.g. dot product), similarity value between documents d₁ and d₃ will be zero since they do not share any terms. But in fact these two documents have some similarities in the context of the dataset through d₂as it can be seen in Fig. 1. This supports the idea that using higher-order paths between documents, it is possible to obtain a non-zero similarity value between d₁ and d₃ which was not possible in traditional Bag of Words (BOW) [2] representation. This value becomes larger if there are many interconnecting documents like d₂ between d₁ and d₃. This may stem from the reason that the two documents are written on the same topic using two different but semantically closer sets of terms.

This project aims to represent these higher-order paths by using Linked-Lists. Consequently, this project is a programming assignment in C, which aims to build an algorithm based on linked-lists that will build an efficient representation of documents.

Fig. 1. a) Illustration of higher-order paths

1^st-order term co-occurrence {t₁, t₂}, {t₂, t₃}, {t₃, t₄}, {t₂, t₄}, {t₄, t₅}

2^nd-order term co-occurrence {t₁, t₃}, {t₁, t₄}, {t₂, t₅}, {t₃, t₅}

3^rd-order term co-occurrence {t₁, t₅}

Fig. 2. b) Graphical demonstration of first-order, second-order and third-order paths between terms through documents

Your program needs to open and read text files under the following directories: econ, magazine and health. These are 3 categories of 1150Haber dataset [3]. The number of documents in these categories will be arbitrary. Furthermore, the number of terms in these documents will also be arbitrary. In other words, the length of these files will be arbitrary.

Your program is expected to do the followings:

a) (60 points) You need to read all the documents under all the categories. Then you need to build a Linked-List (MLL). Each node in this MLL needs to represent a different term in these documents. All the terms in these documents are expected to be in the MLL. There will be cases, the same word occur in different documents, or in the same document. Then, you do not need to add a term into the MLL if it already exists. In other words, be careful about not entering the duplicate records into the MLL. After reading and storing all your data into Linked list, you are expected to find 1^st, 2^nd and 3^rd order term co-occurrences as shown below.

1^st-order term co-occurrence {t₁, t₂}, {t₂, t₃}, {t₃, t₄}, {t₂, t₄}, {t₄, t₅}

2^nd-order term co-occurrence {t₁, t₃}, {t₁, t₄}, {t₂, t₅}, {t₃, t₅}

3^rd-order term co-occurrence {t₁, t₅}

There are several ways to build a suitable data structure by using Linked-List in order to find higher-order paths. In this project, you are also expected to write your algorithms analysis (i.e. Give the big-O time (order) of your code).

Output:

1^st-order term co-occurrence {t₁, t₂}, {t₂, t₃}, {t₃, t₄}, {t₂, t₄}, {t₄, t₅}

2^nd-order term co-occurrence {t₁, t₃}, {t₁, t₄}, {t₂, t₅}, {t₃, t₅}

3^rd-order term co-occurrence {t₁, t₅}

and

big-O time (order) of your code

(20 points)

Output: Your program will output following information:

Most frequent 10 words in the input set of documents for each category, sorted descending by their term frequency (tf) coupled with their tf values. format, example:

Econ	Health	Magazine
Dolar,8	Operation,12	Cinema,24
Bank,7	Medicine,10	Actor,21
Strategy,6	Doctor,8	Theatre,18

Output: Most frequent 10 words in the input set of documents for each category, sorted descending by their term frequency*inverse document frequency (idf) coupled with their tfidf values.

IDF (t) =log _eN/n

N: Total number of documents n: Number of documents with term t in it.

Econ	Health	Magazine
Dolar,1.8	Operation,2.12	Cinema,3.24
Bank,1.7	Medicine,2.10	Actor,3.21
Strategy,0.6	Doctor,1.8	Theatre,2.18

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] CSE225 PROJECT #1-Text representation with Higher-Order Paths

Reviews

Related products

[Solved] CSE225 Data Structures, PROJECT #3

[Solved] CSE225 PROJECT #2-Ternary Search Tree (TST)

[Solved] CSE225 Project1-Text representation with Higher-Order Paths

[Solved] CSE225-Ternary Search Tree (TST) -PROJECT3

[Solved] CSE225 Project2- Binary Search Trees

[Solved] CSE225 PROJECT 2#- Ranking Documents for Information Retrieval with Priority Queues