5/5 - (1 vote)

Dataset

The dataset was adapted from the Wine Quality Dataset

(https://archive.ics.uci.edu/ml/datasets/Wine+Quality (https://archive.ics.uci.edu/ml/datasets/Wine+Quality))

Attribute Information:

For more information, read [Cortez et al., 2009: http://dx.doi.or g /10.1016 /j .dss.2009.05.016 (http://dx.doi.or g /10.1016 /j .dss.2009.05.016)].

Input variables (based on physicochemical tests):

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol

Output variable (based on sensory data):

quality (0: normal wine, 1: good wine)

Problem statement

Predict the quality of a wine given its input variables. Use AUC (area under the receiver operating characteristic curve) as the evaluation metric.

First, lets load and explore the dataset.

In [1]:

In [2]:

Out[2]:

fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide tot

0 0.27 0.36 20.7 0.045 45.0
3 0.30 0.34 1.6 0.049 14.0
1 0.28 0.40 6.9 0.050 30.0
2 0.23 0.32 8.5 0.058 47.0
2 0.23 0.32 8.5 0.058 47.0

In [3]:

<class pandas.core.frame.DataFrame> RangeIndex: 4715 entries, 0 to 4714

Data columns (total 12 columns): fixed_acidity 4715 non-null float64 volatile_acidity 4715 non-null float64 citric_acid 4715 non-null float64 residual_sugar 4715 non-null float64 chlorides 4715 non-null float64 free_sulfur_dioxide 4715 non-null float64 total_sulfur_dioxide 4715 non-null float64 density 4715 non-null float64 pH 4715 non-null float64 sulphates 4715 non-null float64 alcohol 4715 non-null float64 quality 4715 non-null int64 dtypes: float64(11), int64(1) memory usage: 442.2 KB

In [4]:

Out[4]:

3655
1060

Name: quality, dtype: int64

Please note that this dataset is unbalanced.

Questions and Code

[1]. Split the given data using stratify sampling into 2 subsets: training (80%) and test (20%) sets. Use random_state = 42. [1 points] In [5]:

[2]. Use GridSearchCV and Pipeline to tune hyper-parameters for 3 different classifiers including

KNeighborsClassifier , LogisticRegression and svm.SVC and report the corresponding AUC values on the training and test sets. Note that a scaler may need to be inserted into each pipeline. [6 points]

Hint: You may want to use kernel=rbf and tune C and gamma for svm.SVC . Find out how to enable probability estimates (for Question 3).

Document: https://scikit-learn.or g /stable/modules/g enerated/sklearn.svm.SVC.html#sklearn.svm.SVC

(https://scikit-learn.or g /stable/modules/g enerated/sklearn.svm.SVC.html#sklearn.svm.SVC)

In [12]:

K-Nearest Neighbors best parameters: {clf__n_neighbors: 45, clf__p: 1}

K-Nearest Neighbors AUC score(training set): 1.0

K-Nearest Neighbors AUC score(test set): 0.9349366337144774

K-Nearest Neighbors Confusion Matrix(training set):

[[2924 0]

[ 0 848]]

K-Nearest Neighbors Confusion Matrix(test set):

[[701 30]

[ 66 146]] time: 0.13440759579340616

Logistic Regression best parameters: {clf__C: 100, clf__penalty: l1}

Logistic Regression AUC score(training set): 0.7867747883488629 Logistic Regression AUC score(test set): 0.7987184781767029

Logistic Regression Confusion Matrix(training set): [[2754 170]

[ 605 243]]

Logistic Regression Confusion Matrix(test set):

[[690 41]

[158 54]] time: 0.03498464822769165

SVC best parameters: {clf__C: 1, clf__gamma: 100}

SVC AUC score(training set): 0.9991603321890405 SVC AUC score(test set): 0.9088480499703171

SVC Confusion Matrix(training set):

[[2918 6]

[ 43 805]]

SVC Confusion Matrix(test set):

[[718 13]

[112 100]] time: 0.6369452118873596

[3]. Train a soft VotingClassifier with the estimators are the three tuned pipelines obtained from [2]. Report the AUC values on the training and test sets. Comment on the performance of the ensemble model. [1 point]

Hint: consider the voting method.

Document: https://scikitlearn.or g /stable/modules/g enerated/sklearn.ensemble.Votin g Classifier.html#sklearn.ensemble.Votin g Classifier

(https://scikitlearn.or g /stable/modules/g enerated/sklearn.ensemble.Votin g Classifier.html#sklearn.ensemble.Votin g Classifier )

In [13]:

start = time.time()

ensemble = VotingClassifier(estimators=pipelines, voting=soft, n_jobs=-1).fit(X_train

, y_train)

ensemble_train = roc_auc_score(y_train, ensemble.predict_proba(X_train)[:,1], average=

macro)

ensemble_test = roc_auc_score(y_test, ensemble.predict_proba(X_test)[:,1], average=mac ro)

print(VotingClassifier AUC score(training set): {}.format(ensemble_train)) print(VotingClassifier AUC score(test set): {}.format(ensemble_test))

print(VotingClassifier Confusion Matrix(training set):
{}.format(confusion_matrix(y

_train, ensemble.predict(X_train))))

print(VotingClassifier Confusion Matrix(test set):
{}.format(confusion_matrix(y_tes t, ensemble.predict(X_test))))

end = time.time() print(time: {}
.format((end-start)/60))

VotingClassifier AUC score(training set): 0.9999903208321503 VotingClassifier AUC score(test set): 0.9399956121105748 VotingClassifier Confusion Matrix(training set):

[[2923 1]

[ 8 840]] VotingClassifier Confusion Matrix(test set):

[[709 22]

[ 84 128]] time: 0.691833249727885

The ensemble model performs marginally better than K-Nearest Neighbors(the difference is 0.005 so might as well be the same performance), slightly better than SVC and significantly better than logistic regression. The ensemble model doesnt improve on the best performing estimator (KNN) in any meaningful way

[4]. Redo [3] with a sensible set of weights for the estimators. Comment on the performance of the ensemble model in this case. [1 point]

In [14]:

start = time.time() weight_params = []

for w1 in range(1,4):

for w2 in range(1,4):

for w3 in range(1,4): weight_params.append([w1, w2, w3])

ensemble_weighted = VotingClassifier(estimators=pipelines, voting=soft, n_jobs=-1) ensemble_gs = GridSearchCV(ensemble_weighted, param_grid={weights: weight_params}, n_ jobs=-1, cv=3, scoring=roc_auc)

ensemble_fit = ensemble_gs.fit(X_train, y_train)

weighted_train = ensemble_fit.score(X_train, y_train)#roc_auc_score(y_train, ensemble_w eighted.predict_proba(X_train)[:,1], average=macro)

weighted_test = ensemble_fit.score(X_test, y_test)#roc_auc_score(y_test, ensemble_weigh ted.predict_proba(X_test)[:,1], average=macro)

print(VotingClassifier best weights: {}.format(ensemble_fit.best_params_)) print(VotingClassifier(weights={}) AUC score(training set): {}.format(ensemble_fit.be st_params_[weights], weighted_train))

print(VotingClassifier(weights={}) AUC score(test set): {}.format(ensemble_fit.best_p arams_[weights], weighted_test))

print(VotingClassifier(weights={}) Confusion Matrix(training set):
{}.format(ensemb le_fit.best_params_[weights], confusion_matrix(y_train, ensemble_fit.predict(X_train

))))

print(VotingClassifier(weights={}) Confusion Matrix(test set):
{}.format(ensemble_f it.best_params_[weights], confusion_matrix(y_test, ensemble_fit.predict(X_test))))

end = time.time() print(time: {}
.format((end-start)/60))

VotingClassifier best weights: {weights: [2, 1, 1]}

VotingClassifier(weights=[2, 1, 1]) AUC score(training set): 1.0

VotingClassifier(weights=[2, 1, 1]) AUC score(test set): 0.941073226131172 1

VotingClassifier(weights=[2, 1, 1]) Confusion Matrix(training set):

[[2924 0]

[ 0 848]]

VotingClassifier(weights=[2, 1, 1]) Confusion Matrix(test set): [[710 21]

[ 76 136]] time: 24.4167094151179

KNN got a perfect 100% accuracy on the training set and the highest AUC score for the test set. It makes it sensible to have a weight of 2 for KNN and 1 for the others. I also tested it out via

GridSearchCV and it also gave me 2,1,1 as the best parameters. Giving KNN more voting power gave us a 100% on the training set that we didnt get from the unweighted Voting Classifier. It also gives us a better AUC score for the test set.

[5]. Use the VotingClassifier with GridSearchCV to tune the hyper-parameters of the individual estimators. The parameter grid should be a combination of those in [2]. Report the AUC values on the training and test sets. Comment on the performance of the ensemble model. [1 point] Note that it may take a long time to run your code for this question.

Document: https://scikit-learn.or g /stable/modules/ensemble.html#usin g -the-votin g classifier-with-g ridsearchcv (https://scikit-learn.org/stable/modules/ensemble.html#using-the-votingclassifier-with-gridsearchcv)

In [9]:

start = time.time()

params = {} estimators = [] for name, classifier, param in zip(names, classifiers, parameters):

estimators.append((name, classifier)) for k in param: params[k.replace(clf, vote__+name)] = param[k]

vot_ = VotingClassifier(estimators=estimators, voting=soft, n_jobs=-1) pipe=Pipeline(steps=[(scale, scaler), (vote, vot_)])

gs_clf_cv = GridSearchCV(estimator=pipe, param_grid=params, cv=3, n_jobs=-1, scoring=r oc_auc)

clf_cv = gs_clf_cv.fit(X_train, y_train) cv_train_score = clf_cv.score(X_train, y_train) cv_test_score = clf_cv.score(X_test, y_test)

print(VotingClassifier with GridSearchCV best parameters: {}.format(clf_cv.best_param s_))

print(VotingClassifier with GridSearchCV AUC score(training set): {}.format(cv_train_ score))

print(VotingClassifier with GridSearchCV AUC score(test set): {}.format(cv_test_score

))

print(VotingClassifier with GridSearchCV Confusion Matrix(training set):
{}.format( confusion_matrix(y_train, clf_cv.predict(X_train))))

print(VotingClassifier with GridSearchCV Confusion Matrix(test set):
{}.format(conf usion_matrix(y_test, clf_cv.predict(X_test))))

end = time.time() print(time: {}
.format((end-start)/60))

VotingClassifier with GridSearchCV best parameters: {vote__K-Nearest Neig hbors__n_neighbors: 70, vote__K-Nearest Neighbors__p: 2, vote__Logisti c Regression__C: 1000, vote__Logistic Regression__penalty: l1, vote_ _SVC__C: 1, vote__SVC__gamma: 100}

VotingClassifier with GridSearchCV AUC score(training set): 0.999991127429 471

VotingClassifier with GridSearchCV AUC score(test set): 0.9399633482177426 VotingClassifier with GridSearchCV Confusion Matrix(training set):

[[2923 1]

[ 8 840]]

VotingClassifier with GridSearchCV Confusion Matrix(test set): [[715 16]

[ 88 124]] time: 107.44066168467204 Imagine taking 100 minutes to execute and still getting a lower score than the previous two Voting Classifiers. The base Voting Classifier, Voting Classifier with GridSearchCV and SVC all yielded incredibly similiar results whilst the Voting Classifier with estimator weights of 2,1,1 seem to pull ahead by a whopping 0.001 for the test set!

In [ ]:

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] DATA201 Assignment 4

Dataset

Problem statement

Questions and Code

Reviews

Whatsapp Us

[Solved] DATA201 Assignment 4

Dataset

Problem statement

Questions and Code

Reviews

Related products

[Solved] DATA201 Mid-term Test

[Solved] DATA 201 Assignment 2

[Solved] DATA201 Assignment 5

[Solved] DATA201 Project

[Solved] Assignment 1 for DATA201

[Solved] DATA201 Assignment 3 Probability and Statistics, Ethics and privacy