In this assignment, youll need to use the following dataset:
text_train.json: This file contains a list of documents. Its used for training models text_test.json: This file contains a list of documents and their ground-truth labels. Its used for testing performance. This file is in the format shown below. Note, each document has a list of labels. You can load these files using json.load()
Text | Labels |
paraglider collides with hot air balloon | [Disaster and Accident, Travel & Transportation] |
faa issues fire warning for lithium | [Travel & Transportation] |
. |
Q1: K-Mean Clustering
Define a function cluster_kmean() as follows:
Take two file name strings as inputs: train_file is the file path of text_train.json, and test_file is the file path of text_test.json
Use KMeans to cluster documents in train_file into 3 clusters by cosine similarity Test the clustering model performance using test_file:
Predict the cluster ID for each document in test_file.
Lets only use the first label in the ground-truth label list of each test document, e.g. for the first document in the table above, you set the ground_truth label to Disaster and Accident only.
Apply majority vote rule to dynamically map the predicted cluster IDs to the ground-truth labels in test_file. Be sure not to hardcode the mapping (e.g. write code like {0: Disaster and Accident}), because a cluster may corrspond to a different topic in each run.
Calculate precision/recall/f-score for each label
This function has no return. Print out confusion matrix, precision/recall/f-score.
Q2: LDA Clustering
Define a function cluster_lda() as follows:
Take two file name strings as inputs: train_file is the file path of text_train.json, and test_file is the file path of text_test.json
Use LDA to train a topic model with documents in train_file and the number of topics K = 3 Predict the topic distribution of each document in test_file, and select only the topic with highest probability as the predicted topic
Evaluates the topic model performance as follows:
Similar to Q1, lets use the first label in the label list of test_file as the ground_truth label.
Apply majority vote rule to map the topics to the labels.
Calculate precision/recall/f-score for each label and print out precision/recall/f-score. Return topic distribution and the original ground-truth labels of each document in test_file Also, provide a document which contains:
performance comparison between Q1 and Q2 describe how you tune the model parameters, e.g. min_df, alpha, max_iter etc.
Q3 (Bonus): Overlapping Clustering
In Q2, you predict one label for each document in test_file. In this question, try to discover multiple labels if appropriate. Define a function overlapping_cluster as follows:
Take the outputs of Q2 (i.e. topic distribution and the labels of each document in test_file) as inputs
Set a threshold for each topic (i.e. TH = [th0, th1, th2]). A document is predicted to belong to a topic i only if the topic probability > thi for i [0, 1, 2].
The threshold is determined as follows:
Vary the threshold for each topic from 0.05 to 0.95 with an increase of 0.05 in each round to evalute the topic model performance:
Apply majority vote rule to map the predicted topics to the ground-truth labels in test_file
Calculate f1-score for each label
For each label, pick the threshold value which maximizes the f1-score
Return the threshold and f1-score of each label
In [145]:
from sklearn.feature_extraction.text import CountVectorizer from nltk.cluster import KMeansClusterer, cosine_distance from sklearn.decomposition import LatentDirichletAllocation
# add more
In [146]:
actual_class Disaster and Accident News and Economy Travel & Tran sportation
cluster 0 70 0 135
- 130 7 8
- 10 199
41
Cluster 0: Topic Travel & Transportation
Cluster 1: Topic Disaster and Accident
Cluster 2: Topic News and Economy precision recall f1-score support
Disaster and Accident 0.90 0.62 0.73 210
News and Economy 0.80 0.97 0.87 206 Travel & Transportation 0.66 0.73 0.69 184
micro avg 0.77 0.77 0.77 600 macro avg 0.78 0.77 0.77 600 weighted avg 0.79 0.77 0.77 600
iteration: 1 of max_iter: 25 iteration: 2 of max_iter: 25 iteration: 3 of max_iter: 25 iteration: 4 of max_iter: 25 iteration: 5 of max_iter: 25 iteration: 6 of max_iter: 25 iteration: 7 of max_iter: 25 iteration: 8 of max_iter: 25 iteration: 9 of max_iter: 25 iteration: 10 of max_iter: 25 iteration: 11 of max_iter: 25 iteration: 12 of max_iter: 25 iteration: 13 of max_iter: 25 iteration: 14 of max_iter: 25 iteration: 15 of max_iter: 25 iteration: 16 of max_iter: 25 iteration: 17 of max_iter: 25 iteration: 18 of max_iter: 25 iteration: 19 of max_iter: 25 iteration: 20 of max_iter: 25 iteration: 21 of max_iter: 25 iteration: 22 of max_iter: 25 iteration: 23 of max_iter: 25 iteration: 24 of max_iter: 25 iteration: 25 of max_iter: 25
actual_class Disaster and Accident News and Economy Travel & Tran sportation
cluster 0 30 18 138
- 12 182 8
- 168 6
38
Cluster 0: Topic Travel & Transportation
Cluster 1: Topic News and Economy
Cluster 2: Topic Disaster and Accident
precision recall f1-score support
Disaster and Accident 0.79 0.80 0.80 210
News and Economy 0.90 0.88 0.89 206 Travel & Transportation 0.74 0.75 0.75 184
micro avg 0.81 0.81 0.81 600 macro avg 0.81 0.81 0.81 600 weighted avg 0.81 0.81 0.81 600
Disaster and Accident 0.45
News and Economy 0.55 Travel & Transportation 0.30 dtype: float64
Disaster and Accident 0.798122
News and Economy 0.888889 Travel & Transportation 0.773218
Reviews
There are no reviews yet.