week11-2021-sem2-answers
COMP20008 2021S1 workshop week 11
Copyright By Assignmentchef assignmentchef
Chi Squared Feature Selection
The following code implements the example in Slide 19 of the Experimental design lecture
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2_contingency
data = pd.DataFrame(np.array([[1,1,1],[1,0,1],[0,1,0],[0,0,0]]), columns=[a1,a2,c])
features=data[[a1,a2]]
class_label = data[c]
cont_table = pd.crosstab(class_label,features[a1])
chi2_val, p, dof, expected = stats.chi2_contingency(cont_table.values, correction=False)
print(Chi2 value: ,chi2_val)
if(p<0.05) : print(‘Null hypothesis rejected, p value: ‘, p)print(‘Null hypothesis accepted, p value: ‘, p) Chi2 value:4.0Null hypothesis rejected, p value:0.04550026389635857Question 2)Adapt the example above to calculate the Chi2 values for each feature in Question 1.Ensure that the results agree with your answer to Question 1. import scipy.stats as statsfrom scipy.stats import chi2_contingencydata = pd.DataFrame(np.array([[1,0,1],[1,1,1],[1,1,1],[1,0,0],[1,1,1],[0,0,0],[0,0,0],[0,0,0],[1,1,0],[0,0,0]]), columns=[‘A’,’B’,’Class’])features=data[[‘A’,’B’]]class_label = data[‘Class’]for feature in [‘A’,’B’] :cont_table = pd.crosstab(class_label,features[feature])chi2_val, p, dof, expected = stats.chi2_contingency(cont_table.values, correction=False)print(‘Chi2 value for feature’, feature,’: ‘,chi2_val)if(p<0.05) : print(‘Null hypothesis rejected for feature’, feature, ‘p value:’, p)print(‘Null hypothesis accepted for feature’, feature, ‘p value:’, p) Chi2 value for feature A :4.444444444444445Null hypothesis rejected for feature A p value: 0.03501498101966245Chi2 value for feature B :3.4027777777777777Null hypothesis accepted for feature B p value: 0.0650867264927665Experimental EvaluationK-fold cross validation is important to ensure that the results we report are reliable, and not merely the result of a ‘lucky’ test_train split.Below is an example of K-fold cross validation applied to the World Development Index dataset.Note also the process in the loop below – we’ve started by doing the test_train split then performed other functions like scaling and imputation on the training set and applied the results to the testing set.This is important to ensure that we don’t violate the test_train split and apply our understanding of the testing set when building our model. world= pd.read_csv(‘world_org.csv’, encoding=”iso-8859-1″)life = pd.read_csv(‘life.csv’)world.set_index(‘Country Code’)life.set_index(‘Country Code’)all_data = world.merge(life)import pandas as pdimport matplotlib.pyplot as pltfrom sklearn import neighborsfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn import preprocessingfrom sklearn.impute import SimpleImputerfrom sklearn.model_selection import KFold##get just the featuresdata=all_data.iloc[:, 5:-1]##get just the class labelsclasslabel=all_data[‘Life expectancy at birth (years)’]kf = KFold(n_splits=k, shuffle=True, random_state=42)acc_score = []for train_index, test_index in kf.split(data):#Perform the split for this foldX_train, X_test = data.iloc[train_index, :], data.iloc[test_index, :]y_train, y_test = classlabel[train_index], classlabel[test_index]#Scale the datascaler = preprocessing.StandardScaler().fit(X_train)X_train=scaler.transform(X_train)X_test=scaler.transform(X_test)#Impute missing values via mean imputationimp = SimpleImputer(missing_values=np.nan, strategy=’mean’)X_train = imp.fit_transform(X_train)X_test = imp.transform(X_test)#Train k-nn classifierknn = neighbors.KNeighborsClassifier(n_neighbors=5)knn.fit(X_train, y_train)#Predict resulty_pred=knn.predict(X_test)acc_score.append(accuracy_score(y_test, y_pred))print(acc_score)#Display average of accuracy scoresavg_acc_score = sum(acc_score)/kprint(avg_acc_score) [0.7368421052631579, 0.8421052631578947, 0.8421052631578947, 0.7222222222222222, 0.8333333333333334, 0.6666666666666666, 0.8333333333333334, 0.6111111111111112, 0.7222222222222222, 0.4444444444444444]0.7254385964912282Question 3Experiment with different values of k and the random_state parameter.What might be an optimal k value in this case?How could we further improve the reliability of our results?Answer: Generally we would expect larger values of k to give more accurate results as a larger amount of data is being trained on.Values like 2 or 3 are unlikely to be as effective.Higher values of k require more processing time, however.Note that there is a lot of variability still present – which you can see by changing the random_state parameter.Note also that we’re NOT trying to choose the value of k which happens to give us the highest accuracy (e.g. 12 is not a better choice than 13 just because the average accuracy is higher) – that’s simply random behavior and won’t be replicated on other datasets.To further improve the accuracy we could use a Repeated k-Fold cross validator (sklearn.RepeatedKFold), which applies cross validation several times using the same k with different random_states.Principal components analysisPrincipal components analysis can be used for transforming data into a different (lower dimensional) representation.This is particularly useful for visualisation, computational efficiency and removing noisy data.The python sci-kit learn package (sklearn) contains functions which can be used for PCA.Consider the example below of introducing PCA to the previous task import pandas as pdimport matplotlib.pyplot as pltfrom sklearn import neighborsfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn import preprocessingfrom sklearn.impute import SimpleImputerfrom sklearn.model_selection import KFoldfrom sklearn.decomposition import PCA##get just the featuresdata=all_data.iloc[:, 5:-1]##get just the class labelsclasslabel=all_data[‘Life expectancy at birth (years)’]kf = KFold(n_splits=k, shuffle=True, random_state=42)acc_score = []for train_index, test_index in kf.split(data):#Perform the split for this foldX_train, X_test = data.iloc[train_index, :], data.iloc[test_index, :]y_train, y_test = classlabel[train_index], classlabel[test_index]#Scale the datascaler = preprocessing.StandardScaler().fit(X_train)X_train=scaler.transform(X_train)X_test=scaler.transform(X_test)#Impute missing values via mean imputationimp = SimpleImputer(missing_values=np.nan, strategy=’mean’)X_train = imp.fit_transform(X_train)X_test = imp.transform(X_test)#Perform PCApca = PCA(n_components=2)X_train = pca.fit_transform(X_train)X_test = pca.transform(X_test)#Train k-nn classifierknn = neighbors.KNeighborsClassifier(n_neighbors=5)knn.fit(X_train, y_train)#Predict resulty_pred=knn.predict(X_test)acc_score.append(accuracy_score(y_test, y_pred))print(acc_score)#Display average of accuracy scoresavg_acc_score = sum(acc_score)/kprint(avg_acc_score) [0.7368421052631579, 0.6842105263157895, 0.631578947368421, 0.8333333333333334, 0.7222222222222222, 0.6666666666666666, 0.8888888888888888, 0.6666666666666666, 0.7777777777777778, 0.5555555555555556]0.7163742690058479Question 4Experiment with different numbers of components.What gives the best result?How could you decide on the appropriate number of principal components to use?Answer: It’s possible that we might want to use a fixed number of components (e.g. 2 or 3) for visualisation purposes, but for model building it’s useful to look at the variance explained by the principal components we’ve used so far (sklearn_pca.explained_varianceratio), as in the next example.If the vast majority of the variance has been explained by the first x principal components, then x is probably a good number to use.Visualisation using PCAConsider the example below of applying PCA on the iris dataset. iris= pd.read_csv(‘iris.csv’,dtype=None) ###read in datairis2=iris[[“SepalLength”,”SepalWidth”,”PetalLength”,”PetalWidth”]] #retain a copy with only these columns SepalLengthSepalWidthPetalLengthPetalWidth05.13.51.40.214.93.01.40.224.73.21.30.234.63.11.50.245.03.61.40.2……………1456.73.05.22.31466.32.55.01.91476.53.05.22.01486.23.45.42.31495.93.05.11.8150 rows 4 columns import matplotlib.pyplot as pltfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCA #################################################################Example of performing PCA on Iris dataset and visualising##############################################################################sklearn_pca = PCA(n_components=2) #we want just the first two PCsiris_sklearn = sklearn_pca.fit_transform(iris2)print(“Variance explained by each PC”,sklearn_pca.explained_variance_ratio_) #print out the amount of variance explained by each PC#set up the colour schemepalette=palette = [‘blue’,’green’,’red’]colors=iris.Name.replace(to_replace=iris.Name.unique(),value=palette).tolist()#plot the objects along the first two principal components, using the colour schemeplt.scatter(iris_sklearn[:,0],iris_sklearn[:,1],s=60,c=colors) #plot the PC’s in 2D – s marker sizeplt.xlabel(‘1st Principal Component’, fontsize=28)plt.ylabel(‘2nd Principal Component’, fontsize=28)plt.show() Variance explained by each PC [0.92461621 0.05301557]Question 5)What can you observe about the clustering behavior of the iris dataset from the plot above?What other techniques could you use to help visualise the clustering behavior?We can see two of the three classes are very close together while the other is more clearly separated.Scatter plot matrices could be useful, as would VAT.VAT – Visual Assessment for Clustering TendencyWe’ve already seen the VAT algorithm for visualising the clustering tendency of a dataset. Below is python code for VAT.You can treat it as a black box (not worrying about the internal coding details) – a function which can be used to execute VAT on an input dataset. import numpy as npimport math,randomfrom scipy.spatial.distance import pdist, squareformimport pandas as pdimport matplotlib.pyplot as plt%matplotlib inlinedef VAT(R):VAT algorithm adapted from matlab version:http://www.ece.mtu.edu/~thavens/code/VAT.mR (n*n double): Dissimilarity data inputR (n*D double): vector input (R is converted to sq. Euclidean distance)RV (n*n double): VAT-reordered dissimilarity dataC (n int): Connection indexes of MST in [0,n)I (n int): Reordered indexes of R, the input data in [0,n)R = np.array(R)N, M = R.shapeif N != M:R = squareform(pdist(R))J = list(range(0, N))y = np.max(R, axis=0)i = np.argmax(R, axis=0)j = np.argmax(y)y = np.max(y)y = np.min(R[I,J], axis=0)j = np.argmin(R[I,J], axis=0)I = [I, J[j]]J = [e for e in J if e != J[j]]for r in range(2, N-1): y = np.min(R[I,:][:,J], axis=0)i = np.argmin(R[I,:][:,J], axis=0)j = np.argmin(y)y = np.min(y)I.extend([J[j]])J = [e for e in J if e != J[j]]C.extend([i[j]])y = np.min(R[I,:][:,J], axis=0)i = np.argmin(R[I,:][:,J], axis=0)I.extend(J)C.extend(i)RI = list(range(N))for idx, val in enumerate(I):RI[val] = idxRV = R[I,:][:,I]return RV.tolist(), C, I Visualising iris datset using VATWe will first recreate the visualisations of the iris dataset used in lectures (lecture 7). Info about the iris dataset is here.First a heatmap of the raw iris dataset is displayed.Secondly a randomly ordered dissimilarity matrix for the objects in iris is shown – notice the lack of structure. Thirdly the VAT visualisation is produced.The heatmap function from the seaborn package is employed as a convenient tool for plotting heatmaps.Below is an example of the VAT algorithm applied to the same iris dataset iris= pd.read_csv(‘iris.csv’,dtype=None) ###read in datairis2=iris[[“SepalLength”,”SepalWidth”,”PetalLength”,”PetalWidth”]] #retain a copy with only these columns SepalLengthSepalWidthPetalLengthPetalWidth05.13.51.40.214.93.01.40.224.73.21.30.234.63.11.50.245.03.61.40.2……………1456.73.05.22.31466.32.55.01.91476.53.05.22.01486.23.45.42.31495.93.05.11.8150 rows 4 columns import seaborn as sns#################################################################Read in the datset#########################################################################iris= pd.read_csv(‘iris.csv’,dtype=None) ###read in datairis2=iris[[“SepalLength”,”SepalWidth”,”PetalLength”,”PetalWidth”]] #retain a copy with only these columns####Draw heatmap of raw Iris matrix#######j#sns.heatmap(iris2,cmap=’viridis’,xticklabels=True,yticklabels=False)#plt.show()####Visualise the dissimilarity matrix for Iris using a heatmap (without applying VAT)####iris3=iris2.copy().valuesnp.random.shuffle(iris3) ####randomise the order of rows (objects)sq = squareform(pdist(iris3)) ###compute the dissimilarity matrixax=sns.heatmap(sq,cmap=’viridis’,xticklabels=False,yticklabels=False)ax.set(xlabel=’Objects’, ylabel=’Objects’)plt.show()#####Apply VAT Algorithm to Iris dataset and visualise using heatmap########RV, C, I = VAT(iris2)x=sns.heatmap(RV,cmap=’viridis’,xticklabels=False,yticklabels=False)x.set(xlabel=’Objects’, ylabel=’Objects’)plt.show() Question 6)Plot VAT heatmap for iris data and tell how many clusters does the VAT visualisation reveal? How does this compare to the PCA scatterplot?Two clear clusters on the diagonal.There are three clusters (types of) iris, so it is at first surprising we can see only two clusters.However,two of the clusters (virginica and versicolor) are very close together and could arguably be one cluster.Setosa is well separated from these.This largely matches what we’ve seen in the PCA plot beforeIf you get time: Practicing VAT and PCAYou will now practice using the australian crabs dataset from this file. This data describes 200 crabs collected from Fremantle Western Australia. There are two species of crabs – blue and orange. Within each species there are male and female. There are 5 features:FL – frontal lipRW – rear widthCL – carapace lengthCW – carapace widthBD – body depthThe first four of these are visualised as follows:Question 7)Adapt the iris example to produce a VAT heatmap of the australian crabs dataset. How many clusters are there? ###Answer 7crabsall = pd.read_csv(‘australian-crabs.csv’)print(crabsall.shape)crabsall.head(5) speciessexindexFLRWCLCWBD0BlueMale18.16.716.119.07.01BlueMale28.87.718.120.87.42BlueMale39.27.819.022.47.73BlueMale49.67.920.123.18.24BlueMale59.88.020.323.08.2 crabsall = pd.read_csv(‘australian-crabs.csv’)crabs=crabsall[[‘FL’,’RW’,’CL’,’CW’,’BD’]]crabs_std=crabsRV, R, I = VAT(crabs_std)x = sns.heatmap(RV, cmap=’viridis’, xticklabels=False, yticklabels=False)x.set(xlabel=’Objects’, ylabel=’Objects’)plt.show()#can see two large clusters and one small cluster on the diagonal. CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.