You have to build a predictive model for targeting offers to consumers, and conduct some model performance analytics on the result.A financial company keeps records on individuals who had been previously targeted with a direct marketing offer for an identity theft protection (riskmanagement) subscription including their household income, the average amount sold, the frequency of their transactions, and whether or not they bought asubscription in the most recent campaign. This company would like to use data mining techniques to build customer profile models.We will use historical data on past customer responses (contained in the file directMarketing.csv) in order to build a classification model. The model can thenbe applied to a new set of prospective customers whom the organization may contact in a direct marketing campaign.Using python and the package scikit-learn (http://scikit-learn.org/stable/documentation.html) build predictive models using CART (decision trees), supportvector machine, and logistic regression to evaluate whether or not the customer will buy a subscription in this campaign. You may need to pre-process thedata. Logistic regression becomes the benchmark that you will use to compare the rest of algorithms.You must randomly split your data set using 70% and 30% of the observations for the training and test data set respectively.1) Compare the different models explored using the test error rate (percent incorrectly classified), the area under the ROC curve and the confusion matrixagainst the benchmark (logistic regression).2) Use matplotlib to plot the ROC and the precision-recall curves for your models. Discuss and compare the performance of each model according to thesecurves against the benchmark (logistic regression).confusion matrix:|____________ p __________|___________ n ___________|Y | 1s predicted to be 1s | 0s predicted to be 1s |N | 1s predicted to be 0s | 0s predicted to be 0s |In [2]: import osimport numpy as npimport pandas as pdimport mathimport matplotlib.pylab as pltimport matplotlib.pyplot as potimport seaborn as snsfrom sklearn.model_selection import train_test_split%matplotlib inlinesns.set(style=ticks, palette=Set2)In [3]: # Load datapath = ./directMarketing.csvdf = pd.read_csv(path)[[income,firstDate,lastDate,amount,freqSales,saleSizeCode,starCustomer,lastSale,avrSale,class]].dropna()# Transform starCustomer column to a numeric variabledf[starCustomers] = (df.starCustomer == X).astype(int)df = df.drop(starCustomer, axis=columns)##delete the original starCustomer column# Transform saleSizeCode column to a numeric variabledf[saleSizeCodes] = df.saleSizeCode.replace([D,E,F,G],[1,2,3,4])df = df.drop(saleSizeCode, axis=columns)##delete the original saleSizeCode columnprint(df[saleSizeCodes].value_counts())# Take a look at the datadf.head(5)df.shapeIn [4]: class_new = df[class]df.drop(labels=[class], axis=1,inplace = True)df[class] = class_newdf.head(5)In [5]: predictor_columns = df.columns[:-1]print(predictor_columns)rows = 3cols = 3fig, axs = plt.subplots(ncols=cols, nrows=rows, figsize=(5*cols, 6*rows))axs = axs.flatten()for i in range(len(predictor_columns)):df.boxplot(predictor_columns[i], by=class, grid=False, ax=axs[i], sym=k.)plt.tight_layout()Theres no single feature that can separate the data perfectly. Here I use all the rest of features to predict the class.In [6]: #split the datasetX_train, X_test, Y_train, Y_test = train_test_split(df[predictor_columns], df[class], test_size=.3)print(X_test.shape)1. The decision tree modelIn [25]: from sklearn.tree import DecisionTreeClassifierfrom sklearn import metricsfrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import roc_auc_score## decision treedepths = [1,2,3,4,5,6,7,8,9,10]acc_max = 0acclist = []for i in depths:depth = i# Plotdecision_tree = DecisionTreeClassifier(max_depth=depth, criterion=entropy)decision_tree.fit(X_train, Y_train)acc = metrics.accuracy_score(decision_tree.predict(X_test), Y_test)if acc > acc_max:acc_max = accdepth_max = i##confusion matrix for dty_test = Y_testy_pred = decision_tree.predict(X_test)y_predpro = decision_tree.predict_proba(X_test)# print(y_test)# print(y_pred)acclist.append(acc)##test error rate# print(the idea depth is %s and the Accuracy is %.3f % (depth_max,acclist[depth_max-1]))dt_error_rate = 1 acclist[depth_max-1]print(The test error rate is %.3f % dt_error_rate)## the auc (the area under the ROC curve) for svmauc_score = roc_auc_score(y_test,y_predpro[:,1])print(AUC for dt: ,auc_score)##confusion matrix for dtconfusion_matrix_dt = pd.DataFrame(metrics.confusion_matrix(y_test, y_pred, labels=[1, 0]).T,columns=[p, n], index=[Y, N])confusion_matrix_dt.head(2)2. The logistic regression modelIn [26]: ##logisticregressionfrom sklearn import linear_modelimport warningsfrom sklearn.metrics import confusion_matrixfrom sklearn import metricsfrom sklearn.metrics import roc_auc_scorewarnings.filterwarnings(ignore)lin_model = linear_model.LogisticRegression()lin_model.fit(X_train, Y_train)# print (Accuracy = %.3f % (metrics.accuracy_score(lin_model.predict(X_test), Y_test)))y_predpro1 = lin_model.predict_proba(X_test)# print(prob of y_pred,y_predpro1)# print(y_pred)acc = metrics.accuracy_score(lin_model.predict(X_test), Y_test)lin_error_rate = 1 accprint(The test error rate is %.3f % lin_error_rate)##confusion matrix for logistic regressiony_test1 = Y_testy_pred1 = lin_model.predict(X_test)# print(y_test)# print(y_pred)## the auc (the area under the ROC curve) for logistic regressionauc_score1 = roc_auc_score(y_test1,y_predpro1[:,1])print(AUC for logistic regression: ,auc_score1)##confusion matrixconfusion_matrix_lg = pd.DataFrame(metrics.confusion_matrix(y_test1, y_pred1, labels=[1, 0]).T,columns=[p, n], index=[Y, N])confusion_matrix_lg.head(2)3. The SVM modelIn [27]: ##svmfrom sklearn.model_selection import cross_val_scorefrom sklearn import svmimport warningsfrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import roc_auc_scorewarnings.filterwarnings(ignore)clf = svm.SVC(probability=True)clf.fit(X_train, Y_train)# print (Accuracy = %.3f % (metrics.accuracy_score(clf.predict(X_test), Y_test)))acc = metrics.accuracy_score(clf.predict(X_test), Y_test)svm_error_rate = 1 accprint(The test error rate is %.3f % svm_error_rate)y_predpro2 = clf.predict_proba(X_test)# print(prob of y_pred,y_predpro2)##confusion matrix for svmy_test2 = Y_testy_pred2 = clf.predict(X_test)# print(y_test)# print(y_pred)## the auc (the area under the ROC curve) for svmauc_score2 = roc_auc_score(y_test2,y_predpro2[:,1])print(AUC for SVM : , auc_score2)##confusion matrixconfusion_matrix_svm = pd.DataFrame(metrics.confusion_matrix(y_test2, y_pred2, labels=[1, 0]).T,columns=[p, n], index=[Y, N])confusion_matrix_svm.head(2)Compare the different models aboveIn [28]: cols = [test error rate, auc]res = [[dt_error_rate,auc_score],[lin_error_rate,auc_score1],[svm_error_rate,auc_score2]]index = [dt,lg,svm]ans = pd.DataFrame(res,index,cols)ans.head(3)In [29]: print(Decision tree confusion matrix)print(confusion_matrix_dt)print(Logistic regression confusion matrix)print(confusion_matrix_lg)print(Svm confusion matrix)print(confusion_matrix_svm)I used the three different models namely, decision tree, logistic regression and support vector machine to predict the class based on the same training andtesting dataset. According to the above results, as for test error rate, logistic regression has the best test error rate which is 0.415657 while the decision treewith the test error rate 0.422863 behaves better than the svm. The area under the ROC curve shows that given a positive sample and a negative sample, theprobability of the model to assign a larger score to this positive sample. Hence we know that the larger the auc score is, the better the model is. In this case,logistic regression is the best with the auc 0.61 while the svm is the worst one with auc 0.54. But all of them are better than random guessing which is auc =0.5. With the confusion matrix, the Yp and Nn show that the correct classification. The logistic regression correctly predict the largest numebr of the test data.In [23]: from sklearn.metrics import precision_recall_curvedef get_pr(test,predpro):true = testscores = predpro[:,1]precision, recall, thresholds = precision_recall_curve(true, scores)return precision, recall, thresholdsplt.figure(1)plt.title(Precision/Recall Curve)# give plot a titleplt.xlabel(Recall)# make axis labelsplt.ylabel(Precision)precision, recall, thresholds = get_pr(y_test, y_predpro)plt.figure(1)plt.plot(precision, recall,label = P-R curve for dt)precision1, recall1, thresholds1 = get_pr(y_test1, y_predpro1)plt.figure(1)plt.plot(precision1, recall1,label = P-R curve for logistic regression)precision2, recall2, thresholds2 = get_pr(y_test2, y_predpro2)plt.plot(precision2, recall2,label = P-R curve for svm)plt.legend()plt.show()Precision shows that the ratio of true 1s over all predicted 1s. The recall shows the the ratio of predicted 1s over all true 1s. It more reach to the top rightcorner when the model behaves better. Logistic regression behaves best among the three while the svm behaves the worst.In [24]: ##plot the rocdef get_roc(y_test,y_predpro):fpr, tpr, threshold_roc = metrics.roc_curve(y_test, y_predpro[:,1])roc_auc = metrics.auc(fpr, tpr)return fpr,tpr,roc_aucfpr,tpr,roc_auc = get_roc(y_test,y_predpro)fpr1,tpr1,roc_auc1 = get_roc(y_test1,y_predpro1)fpr2,tpr2,roc_auc2 = get_roc(y_test2,y_predpro2)plt.plot(fpr, tpr, label = AUC = %0.3f for dt % roc_auc)plt.plot(fpr1, tpr1, label = AUC = %0.3f for lg % roc_auc1)plt.plot(fpr2, tpr2, label = AUC = %0.3f for svm % roc_auc2)plt.title(ROC Curve)# give plot a titleplt.ylabel(True Positive Rate)plt.xlabel(False Positive Rate)plt.legend()plt.show()In the ROC curve, it shows that whether a model has a good ability to do the prediciton. True positice means that the number of predicted 1s among true 1swhile the false positive means that the number of predicted 1s among true 0s. y=x is the line showing random guessing. In the above chart, all models behavebetter than random guessing. When the curve reaches the top left corner, it indicates that the model has a good ability since the tpr is 1 while the fpr goes to 0.Logistic regression behaves the best among them and the decision tree behaves almost the same but slightly worse than logistic regression while the svmbehaves the worst.In [ ]:3 46662 26224 17781 1108Name: saleSizeCodes, dtype: int64Out[3]: (10174, 10)Out[4]:income firstDate lastDate amount freqSales lastSale avrSale starCustomers saleSizeCodes class0 3 9409 9509 0.06 1 50 30.00 0 4 01 2 9201 9602 0.16 4 20 20.55 1 4 12 0 9510 9603 0.20 4 5 8.75 0 2 03 6 9409 9603 0.13 2 25 22.50 0 4 04 0 9310 9511 0.10 1 25 12.50 0 4 0Index([income, firstDate, lastDate, amount, freqSales, lastSale,avrSale, starCustomers, saleSizeCodes],dtype=object)(3053, 9)The test error rate is 0.423AUC for dt: 0.596371801299397Out[25]:p nY 783 570N 721 979The test error rate is 0.416AUC for logistic regression: 0.6106571415326292Out[26]:p nY 863 628N 641 921The test error rate is 0.466AUC for SVM : 0.5448200108512012Out[27]:p nY 820 739N 684 810Out[28]:test error rate aucdt 0.422863 0.596372lg 0.415657 0.610657svm 0.466099 0.544820Decision tree confusion matrixp nY 783 570N 721 979Logistic regression confusion matrixp nY 863 628N 641 921Svm confusion matrixp nY 820 739N 684 810
COMS4995
[Solved] COMS4995 Homework 2-Support vector machine logistic regression
$25
File Name: COMS4995_Homework_2-Support_vector_machine_logistic_regression.zip
File Size: 584.04 KB
Only logged in customers who have purchased this product may leave a review.
Reviews
There are no reviews yet.