Hyperparameter Tuning and Cross Validation
DS4400: Machine Learning I Bilal Ahmed
Introduction
Ridge Regression – Assignment 1 (Part-II) Due By: 7/17/2023 11:59 PM EST
In this assignment, we will implement hyperparameter tuning using grid search. Additionally, we will use random k-fold cross validation to estimate the error for our ridge regression implementation. You may use any libraries you wish for plotting. You may submit your solutions as python programs + well-named image files (png, jpg, or pdf) or you may submit a Jupyter notebook. If using Jupyter, please also upload the notebook saved as a .pdf.
Datasets
-
Concrete: The dataset is taken from the UCI machine learning datasets repository and has nine numerical attributes. The aim of the dataset is to predict the compressive strength of different compositions and age of concrete. The target column is named ‘strength’ while all other columns should be used as input features to ridge regression.
Instructions
-
Read the concrete data set (8 attributes, one response variable) into a dataframe using pandas
-
Read the scikit-learn documentation for the StandardScaler (here), Pipeline (here), and Ridge Regression implementation (here)
-
Create a pipeline that can standardize the input data and then predict / train a ridge regressor.
-
Hyperparameter Tuning: To estimate the best value of alpha (lambda in the course notes) for our ridge regression model, we will use grid search. Scikit-learn has a built-in method for doing grid search called GridSearchCV (doc). Additionally, we also need to implement cross validation for estimating model performance at each grid point. To this end, we will use k-fold cross validation that is implemented in scikit-learn as KFold (doc).
-
Create a KFold object with k=5 (for five fold cross validation), setting random_state=44 and shuffle=True. What do these parameters signify and what is their importance for estimating model performance? (5 points)
-
Perform grid search using the k-fold object in the previous step optimizing mean squared error (MSE).
-
Use a grid with alpha values = [0, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0,
50.0]
-
-
-
Estimating MSE for the dataset: Using the optimal value of alpha that we obtained using grid search, we will now estimate the MSE of our ridge regressor on the concrete dataset.
-
For k = [5, 10] set up a KFold object similar to the settings in part 4a. (These will be two different objects)
-
Use scikit-learn’s cross_val_score function (doc) estimate the model performance (mean squared error) on the concrete dataset for both k=5 and k=10. (25 points)
-
-
Using only a single run of the k-fold strategy can result in noisy estimates and a standard way of reducing the noise is to run k-fold multiple times using different partitions (randomly created) in each run. Scikit-learn provides an implementation of this strategy using RepeatedKFold (doc). Re-Implement 5a and 5b using the 5 and 10 number of repeats. Keep the random_state the same and set the n_repeats to 5 and 10 for each value of k in 5a.
-
Implementation (25 points)
-
Why would running k-fold once produce noisy estimates? (5 points)
-
How would the repeated k-fold strategy scale as the size of the dataset both in terms of the number of data points and the number of input features increases? (5 points)
-
Reviews
There are no reviews yet.