1 Setting up the data
The following is the snippet of code to load the datasets, and split it into train and validation data:
# Data LoadingX = np.genfromtxt(data/X_train.txt, delimiter=None)Y = np.genfromtxt(data/Y_train.txt, delimiter=None)X,Y = ml.shuffleData(X,Y) |
1
2
3
4
- Print the minimum, maximum, mean, and the variance of all of the features. 5 points
- Split the dataset, and rescale each into training and validation, as:
Xtr, Xva, Ytr, Yva = ml.splitData(X, Y)Xt, Yt = Xtr[:5000], Ytr[:5000] # subsample for efficiency (you can go higher)XtS, params = ml.rescale(Xt) # Normalize the featuresXvS, _ = ml.rescale(Xva, params) # Normalize the features |
1
2
3
4
Print the min, maximum, mean, and the variance of the rescaled features. 5 points
2 Linear Classifiers
In this problem, you will use an existing implementation of logistic regression, from the last homework, to analyze its performance on the Kaggle dataset.
learner = mltools.linearC.linearClassify() learner.train(XtS, Yt, reg=0.0, initStep=0.5, stopTol=1e-6, stopIter=100) learner.auc(XtS, Yt) # train AUC |
1
2
3
- One of the important aspects of using linear classifiers is the regularization. Vary the amount of regularization, reg , in a wide enough range, and plot the training and validation AUC as the regularization weight is varied. Show the plot. 10 points
- We have also studied the use of polynomial features to make linear classifiers more complex. Add degree 2 polynomial features, print out the number of features, why it is what it is. 5 points
- Reuse your code that varied regularization to compute the training and validation performance (AUC) for this transformed data. Show the plot. 5 points
3 Nearest Neighbors
In this problem, you will analyze an existing implementation of K-Nearest-neighbor classification for the Kaggle dataset. The K-nearest neighbor classifier implementation supports two hyperparameters: the size of the neighborhood, K, and how much to weigh the distance to the point, a (0 means no unweighted average, and the higher the , the higher the closer ones are weighted[1]). Note, you might have to subsample a lot for KNN to be efficient.
learner = mltools.knn.knnClassify() learner.train(XtS, Yt, K=1, alpha=0.0) learner.auc(XtS, Yt) # train AUC |
1
2
3
- Plot of the training and validation performance for an appropriately wide range of K, with = 0. 5 points
- Do the same with unscaled/original data, and show the plots. 5 points
- Since we need to select both the value of K and , we need to vary both, and see how the performance changes. For a range of both K and , compute the training and validation AUC (for unscaled or scaled data,
whichever you think would be a better choice), and plot them in a two dimensional plot like so:
K = range(1,10,1) # Or something else A = range(0,5,1) # Or something else tr_auc = np.zeros((len(K),len(A))) va_auc = np.zeros((len(K),len(A))) for i,k in enumerate(K):for j,a in enumerate(A):tr_auc[i][j] = # train learner using k and a va_auc[i][j] = # Now plot it f, ax = plt.subplots(1, 1, figsize=(8, 5)) cax = ax.matshow(mat, interpolation=nearest) f.colorbar(cax) ax.set_xticklabels([]+A) ax.set_yticklabels([]+K) plt.show() |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Show both the plots, and recommend a choice of K and based on these results. 10 points
4 Decision Trees
For this problem, you will be using a similar analysis of hyper-parameters for the decision tree implementation.
maxDepth |
There are three hyper-parameters in this implementation that become relevant to its performance;, minParent , and minLeaf , where the latter two specify the minimum number of data points necessary to split a
node and form a node, respectively.
learner = ml.dtree.treeClassify(Xt, Yt, maxDepth=15) |
1
maxDepth |
- Keeping minParent=2 and minLeaf=1 , varyto a range of your choosing, and plot the training and validation AUC. 5 points
- Plot the number of nodes in the tree as maxDepth is varied (using sz ). Plot another line in this plot by increasing either minParent or minLeaf (choose either, and by how much). 5 points
maxDepth |
- Setto a fixed value, and plot the training and validation performance of the other two hyperparameters in an appropriate range, using the same 2D plot we used for nearest-neighbors. Show the plots, and recommend a choice for minParent and minLeaf based on these results. 10 points
5 Neural Networks
Last we will explore the use of neural networks for the same Kaggle dataset. The neural networks contain many possible hyper-parameters, such as the number of layers, the number of hidden nodes in each layer, the activation function the hidden units, etc. These dont even take into account the different hyper-parameters of the optimization algorithm.
nn = ml.nnet.nnetClassify() nn.init_weights([[XtS.shape[1], 5, 2], random, XtS, Yt) # as many layers nodes you want nn.train(XtS, Yt, stopTol=1e-8, stepsize=.25, stopIter=300) |
1
2
3
- Vary the number of hidden layers and the nodes in each layer (we will assume each layer has the same number of nodes), and compute the training and validation performance. Show 2D plots, like for decision trees and K-NN classifiers, and recommend a network size based on the above.
- Implement a new activation function of your choosing, and introduce it as below:
def sig(z): return np.atleast_2d(z) def dsig(z): return np.atleast_2d(1) nn.setActivation(custom, sig, dsig) |
1
2
3
logistic | and | htangent |
Compare the performance of this activation function with, in terms of the training and validation performance.
6 Conclusions
Pick the classifier that you think will perform best, mention all of its hyper-parameter values, and explain the reason for your choice. Train it on as much data as you can, preferably all of X , submit the predictions on Xtest to Kaggle, and include your Kaggle username and leaderboard AUC in the report. Heres the code to create the Kaggle submission:
Xte = np.genfromtxt(data/X_test.txt, delimiter=None) learner = .. # train one using X,YYte = np.vstack((np.arange(Xte.shape[0]), learner.predictSoft(Xte)[:,1])).T np.savetxt(Y_submit.txt, Yte, %d, %.2f, header=ID,Prob1, comments=, delimiter=,) |
1
2
3
4
Statement of Collaboration
It is mandatory to include a Statement of Collaboration in each submission, with respect to the guidelines below. Include the names of everyone involved in the discussions (especially in-person ones), and what was discussed.
All students are required to follow the academic honesty guidelines posted on the course website. For programming assignments, in particular, I encourage the students to organize (perhaps using Campuswire) to discuss the task descriptions, requirements, bugs in my code, and the relevant technical content before they start working on it. However, you should not discuss the specific solutions, and, as a guiding principle, you are not allowed to take anything written or drawn away from these discussions (i.e. no photographs of the blackboard,
written notes, referring to Campuswire, etc.). Especially after you have started working on the assignment, try to restrict the discussion to Campuswire as much as possible, so that there is no doubt as to the extent of your collaboration.
Reviews
There are no reviews yet.