Lab Assignment 1
Vasant Honavar
DS 310 Machine Learning for Data Science
Due October 2, 11:59 PM on Canvas
In all of the following exercises, if there is a need for a random seed, set it to 1957. You may use sklearn libraries as long as you understand and document the functionality of each library function used in your code. When you need to interpret, explain or discuss your results, do so in your Ipython Notebook by changing the cell type and writing your interpretation immediately below the code and its execution results so that your discussion can be matched with the result and the code. Submit a single Ipython Notebook in which all of the solutions are organized in a way that can be executed and evaluated.
1. Regression. In this exercise, we will use linear regression to estimate the number of hours a person would be absent from work given the workers attributes. We will use the absenteeism at work data set from the UCI repository which you can download from https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work. The data set contains worker attributes such as age, education, reasons for absence, etc., as well as the target variable, which is absenteesim time in hours.
(a) How many data samples does the data set include? How many features are used to encode each data sample?
(b) Randomly split the data into training set and test set with the ratio 80/20, that is, use 80% of the data for training the model, and the remaining 20% for testing, with the pre-specified random seed. Train a linear regression model on the training data. Then, use the trained model to estimate the hours of absence for each worker in the test data. Report the average root mean squared error (RMSE) on the test data.
(c) Perform 10-fold cross validation and report the RMSE obtained from each fold as well as their average.
(d) How does the result compare with the 80/20 split used above? Comment on why it is better to perform cross-validation as opposed to simply train and test with a random split of the data?
2. K Nearest Neighbors (KNN) Classification. Dr. Doolittle, our collaborator at Penn State Hershey Medical School is interested in predicting whether a cell is cancerous or not. Suppose we want to test the feasibility of Dr. Doolittles idea using
1
the Wisconsin Breast Cancer data set from the UCI repository wherein each cell is described in terms of its characteristics such as its size, shape, etc., along with a label, i.e, benign, or malignant. Load the Breast Cancer data set from sklearn datasets.
(a) How many data samples does the data set include? How many features are used to encode each data sample?
(b) Randomly split the data into training set and test set with the ratio 80/20, that is, use 80% of the data to train the model, and the remaining 20% for testing, with the pre-specified random seed. Train a 7 nearest neighbor classifier on the training data. Then, use the trained model on the test data and report accuracy, sensitivity, specificity, false alarm rate and the area under the ROC curve (AUC).
(c) Perform 5-fold cross validation. Plot the ROC for each run of the 5-fold cross- validation and estimate the accuracy, sensitivity, specificity, false alarm rate, and the AUC. Report the performance averaged across the 5 folds as well as the average ROC. How does the result compare with the result obtained using the 80/20 random split above? Comment on why it is better to perform cross-validation as opposed to simply train and test with a random split of the data?
3. K Neighbors Regression. Return to the data set in Problem 1. This time, we are interested in k nearest neighbors regression instead of a regression on the whole data set and we would like to analyze what would be reasonable number of neighbors and what distance to use based on the data. To do so, perform 10-fold cross validation. In each fold, fit a weighted linear regression in the following manner: For a given test data sample (the query sample), we would like to estimate the prediction based on a regression using its k {1,,10} nearest neighbors, with the contribution of each of the k neighbors inversely weighted by the square of the distance of the neighbor from the query sample. We would like to use the Minkowski distance with degree p {1,,10}. For each fold, report the k and p at which we obtain the lowest RMSE. Report the average RMSE for each choice of k and p, based on the 10-fold cross-validation. Based on these results, what would be your optimal k and p? How does the average RMSE with the optimal k and p compare to the one you obtained in Problem 1? Explain.
2
Reviews
There are no reviews yet.