[Solved] BIA660-Assignment6- Clustering and Topic Modeling

$25

File Name: BIA660-Assignment6-_Clustering_and_Topic_Modeling.zip
File Size: 461.58 KB

SKU: [Solved] BIA660-Assignment6- Clustering and Topic Modeling Category: Tag:
5/5 - (1 vote)

In this assignment, youll need to use the following dataset:

text_train.json: This file contains a list of documents. Its used for training models text_test.json: This file contains a list of documents and their ground-truth labels. Its used for testing performance. This file is in the format shown below. Note, each document has a list of labels. You can load these files using json.load()

Text Labels
paraglider collides with hot air balloon [Disaster and Accident, Travel & Transportation]
faa issues fire warning for lithium [Travel & Transportation]
.

Q1: K-Mean Clustering

Define a function cluster_kmean() as follows:

Take two file name strings as inputs: train_file is the file path of text_train.json, and test_file is the file path of text_test.json

Use KMeans to cluster documents in train_file into 3 clusters by cosine similarity Test the clustering model performance using test_file:

Predict the cluster ID for each document in test_file.

Lets only use the first label in the ground-truth label list of each test document, e.g. for the first document in the table above, you set the ground_truth label to Disaster and Accident only.

Apply majority vote rule to dynamically map the predicted cluster IDs to the ground-truth labels in test_file. Be sure not to hardcode the mapping (e.g. write code like {0: Disaster and Accident}), because a cluster may corrspond to a different topic in each run.

Calculate precision/recall/f-score for each label

This function has no return. Print out confusion matrix, precision/recall/f-score.

Q2: LDA Clustering

Define a function cluster_lda() as follows:

Take two file name strings as inputs: train_file is the file path of text_train.json, and test_file is the file path of text_test.json

Use LDA to train a topic model with documents in train_file and the number of topics K = 3 Predict the topic distribution of each document in test_file, and select only the topic with highest probability as the predicted topic

Evaluates the topic model performance as follows:

Similar to Q1, lets use the first label in the label list of test_file as the ground_truth label.

Apply majority vote rule to map the topics to the labels.

Calculate precision/recall/f-score for each label and print out precision/recall/f-score. Return topic distribution and the original ground-truth labels of each document in test_file Also, provide a document which contains:

performance comparison between Q1 and Q2 describe how you tune the model parameters, e.g. min_df, alpha, max_iter etc.

Q3 (Bonus): Overlapping Clustering

In Q2, you predict one label for each document in test_file. In this question, try to discover multiple labels if appropriate. Define a function overlapping_cluster as follows:

Take the outputs of Q2 (i.e. topic distribution and the labels of each document in test_file) as inputs

Set a threshold for each topic (i.e. TH = [th0, th1, th2]). A document is predicted to belong to a topic i only if the topic probability > thi for i [0, 1, 2].

The threshold is determined as follows:

Vary the threshold for each topic from 0.05 to 0.95 with an increase of 0.05 in each round to evalute the topic model performance:

Apply majority vote rule to map the predicted topics to the ground-truth labels in test_file

Calculate f1-score for each label

For each label, pick the threshold value which maximizes the f1-score

Return the threshold and f1-score of each label

In [145]:

from sklearn.feature_extraction.text import CountVectorizer from nltk.cluster import KMeansClusterer, cosine_distance from sklearn.decomposition import LatentDirichletAllocation

# add more

In [146]:

actual_class Disaster and Accident News and Economy Travel & Tran sportation

cluster 0 70 0 135

  • 130 7 8
  • 10 199

41

Cluster 0: Topic Travel & Transportation

Cluster 1: Topic Disaster and Accident

Cluster 2: Topic News and Economy precision recall f1-score support

Disaster and Accident 0.90 0.62 0.73 210

News and Economy 0.80 0.97 0.87 206 Travel & Transportation 0.66 0.73 0.69 184

micro avg 0.77 0.77 0.77 600 macro avg 0.78 0.77 0.77 600 weighted avg 0.79 0.77 0.77 600

iteration: 1 of max_iter: 25 iteration: 2 of max_iter: 25 iteration: 3 of max_iter: 25 iteration: 4 of max_iter: 25 iteration: 5 of max_iter: 25 iteration: 6 of max_iter: 25 iteration: 7 of max_iter: 25 iteration: 8 of max_iter: 25 iteration: 9 of max_iter: 25 iteration: 10 of max_iter: 25 iteration: 11 of max_iter: 25 iteration: 12 of max_iter: 25 iteration: 13 of max_iter: 25 iteration: 14 of max_iter: 25 iteration: 15 of max_iter: 25 iteration: 16 of max_iter: 25 iteration: 17 of max_iter: 25 iteration: 18 of max_iter: 25 iteration: 19 of max_iter: 25 iteration: 20 of max_iter: 25 iteration: 21 of max_iter: 25 iteration: 22 of max_iter: 25 iteration: 23 of max_iter: 25 iteration: 24 of max_iter: 25 iteration: 25 of max_iter: 25

actual_class Disaster and Accident News and Economy Travel & Tran sportation

cluster 0 30 18 138

  • 12 182 8
  • 168 6

38

Cluster 0: Topic Travel & Transportation

Cluster 1: Topic News and Economy

Cluster 2: Topic Disaster and Accident

precision recall f1-score support

Disaster and Accident 0.79 0.80 0.80 210

News and Economy 0.90 0.88 0.89 206 Travel & Transportation 0.74 0.75 0.75 184

micro avg 0.81 0.81 0.81 600 macro avg 0.81 0.81 0.81 600 weighted avg 0.81 0.81 0.81 600

Disaster and Accident 0.45

News and Economy 0.55 Travel & Transportation 0.30 dtype: float64

Disaster and Accident 0.798122

News and Economy 0.888889 Travel & Transportation 0.773218

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] BIA660-Assignment6- Clustering and Topic Modeling
$25