[Solved] BIA660-Assignment5- Classification

$25

File Name: BIA660-Assignment5-_Classification.zip
File Size: 320.28 KB

SKU: [Solved] BIA660-Assignment5- Classification Category: Tag:
5/5 - (1 vote)

Classification

This assignment needs train.csv and test.csv. train.csv is for training and test.csv is for test. Both of them have samples in the following format:

label text
2 I must admit that Im addicted to Version 2.0
1 I think its such a shame that an enormous tal
2 The Sunsout No Room at The Inn Puzzle has oddl

Write a function classify to conduct a classification experiement as follows:

  1. Take the training and testing file names (strings) as inputs, e.g. classify(training_file, testing_file).
  2. Classify text samples in the training file using Multinomial Naive Bayes model as follows:
  3. First apply grid search with 5-fold cross validation to find the best values for parameters min_df, stop_words, and alpha of Naive Bayes model that are used the modeling pipeline. Use f1-macro as the scoring metric to select the best parameter values. Potential values for these parameters are:

min_df : [1,2,3]

stop_words : [None,english] alpha: [0.5,1,2]

  1. Using the best parameter values, train a Multinomial Naive Bayes classifier with all samples in the training file
  2. Test the classifier created in Step 2.b using the test file. Report the testing performance as:

Precision, recall, and f1-score of each label

Treat label 2 as the positive class, plot precision-recall curve and ROC curve, and calculate

AUC.

  1. Your function classify has no return. However, when this function is called, the best parameter values from grid search is printed and the testing performance from Step 3 is printed.

Q2. How many samples are enough? Show the impact of sample size on classifier performance

This question will use train_large.csv dataset.

Write a function impact_of_sample_size as follows:

Take the full file name path string for a dataset inputs, e.g. impact_of_sample_size(dataset_file). Starting with 800 samples from the dataset, in each round you build a classifier with 400 more samples. i.e. in round 1, you use samples from 0:800, and in round 2, you use samples from 0:1200, , until you use all samples.

In each round, do the following:

  1. create tf-idf matrix using TfidfVectorizer with stop words removed
  2. train a classifier using multinomial Naive Bayes model with 5-fold cross validation 3. train a classifier using linear support vector machine model with 5-fold cross validation 4. for each classifier, collect the following average metrics across 5 folds:

average F1 macro average AUC: treat label 2 as the positive class, and set roc_auc along with f1_macro as metrics

Plot a line chart (two lines, one for each classifier) show the relationship between sample size and F1-score. Similarly, plot another line chart to show the relationship between sample size and AUC

Write your analysis in a separate pdf file (not in code) on the following: (1 point) How does the sample size affect each classifiers performance?

How many samples do you think would be needed for each model for good performance? How is performance of SVM classifier compared with Nave Bayes classifier, as the sample size increases?

There is no return for this function, but the charts should be plotted.

Q3 (Bonus): Predict duplicate questions by classification

You have tired to predict duplicate questions using the dataset quora_duplicate_question_500.csv by similarity. This time, try to use a classification model to predict if a question pair (q1, q2) are indeed duplicate.

q1 q2 is_duplicate
How do you take a screenshot on a Mac laptop? How do I take a screenshot on my MacBook Pro? 1
Is the US election rigged? Was the US election rigged? 1
How scary is it to drive on the road to Hana g Do I need a four-wheel-drive car to drive all 0

In your Assignment 4, with cosine similarity, the AUC is about 74%. In this assignment, define a function classify_duplicate to achieve the following:

Take the full name of the dataset file (i.e.quora_duplicate_question_500.csv) as the input do feature engineering to extract a number of good features. A few possible options for feature engineering can be:

Unigram, bigram, trigram etc.

Keep or remove stop words

Different metrics, e.g. cosine similarity, BM25 score

(https://en.wikipedia.org/wiki/Okapi_BM25 (https://en.wikipedia.org/wiki/Okapi_BM25)), etc.

build a classification model (e.g. SVM) using these features to predict if a pair questions are duplicate or not.

Your target is to improve the average AUC of the positive class through 5-fold cross validation by at least 1%, reaching 75% or higher. return the average AUC

In [238]: # Q1 In [240]: #Q3

tfidf__stop_words: None best f1_macro: 0.7134380001639543

precision recall f1-score support

1 0.74 0.76 0.75 99 2 0.76 0.74 0.75 102

micro avg 0.75 0.75 0.75 201 macro avg 0.75 0.75 0.75 201 weighted avg 0.75 0.75 0.75 201

0.835016835016835

Q3: 0.760092681967682 In [ ]:

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] BIA660-Assignment5- Classification
$25