Gradient Descent and Newtons Method
In this problem, you will implement (unregularized/regularized) logistic regression for binary classification problems using two dierent types of optimization approaches, namely, batch gradient descent and Newtons method. Two data sets are given; one of which is text data from which you will learn to construct features. For each of the problems, you need to report your results on both of the datasets. Please note that there are 9 problems in total. Also, cross validation is NOT needed for this programming assignment. For any plot you make, you MUST at least provide title, x-label, y-label and legend (if there is more than one curves). Please read submission instructions carefully before submitting your report and source codes.
1 Data
Ionosphere This dataset contains 351 instances and each instance has 34 attributes (features). All feature values are continuous and the last column is the class label (bad = b = 1, good = g = 0). Your goal is to predict the correct label, which is either b or g. We already divided the dataset into training and test sets (iono train.dat and iono test.dat). Please use the training set to build your model and use the test set to evaluate your model. For more details about Ionosphere dataset, please refer to UCI website: https://archive.ics.uci.edu/ml/datasets/Ionosphere.
EmailSpam This dataset contains 943 instances and each of them is labeled as either a spam or ham (not spam) email. Each data instance is the text which is the content of the email (subject and body), and your goal is to classify each email as either a spam or a ham. We have already divided this dataset into training and test datasets, and each of these datasets is stored in two distinct folders (one corresponding to ham and the other corresponding the spam). In other words, to build your model, you need to iterate through all the text files within /train/spam and /train/ham, and to test your model you need to iterate through /test/spam and /test/ham.
2 Feature Representation
An essential part of machine learning is to build the feature representation for raw input data, which is often unstructured. Once we construct the features, each data instance xi can be represented as:
xi = (xi1,,xid)
where xij denotes the jth feature for ith instance.
Algorithm 1 Pseudocode for generating bag-of-word features from text
1: Initialize feature vector bg feature = [0,0,,0]
2: for token in text.tokenize() do
3: if token in dict then
4: token idx = getIndex(dict, token)
5: bg feature[token idx]
6: else
7: continue
8: end if
9: end for
10: return bg feature
Ionosphere The dataset is well-formatted, so you can directly use the raw feature values to build your model. Please do not do any normalization. Also, since there is no categorical attribute, converting to binary features (which you did for HW #1) is not needed.
EmailSpam Since each data instance is text, you need to find a way converting it into a feature vector. In this homework, you will use the Bag-of-Words representation. More specifically, you will convert the text into a feature vector in which each entry of the vector is the count of words occur in that text. You will be provided with the predefined dictionary (dic.dat), and you should only consider words that appear in that dictionary and ignore all others.[1] Below is the pseudocode for generating bag-of-word features from text. For tokenization, please tokenize the string only using whitespace and these three delimiters: .,?. See below for an example:2
Email: hey, i have a better oer for you, oer. better than all other spam filters. Do you like accepting oer?
Pre-defined Dictionary: [add, better, email, filters, hey, oer,like, spam,special]
Bag-of-words feature vector: [0,2,0,1,1,3,1,1,0]
(Q1). After converting all training data into bag-of-words feature vectors, what are the 3 words that occur most frequently? Report the results using this format:
{(word1: # of occurrences), (word2: # of occurrences), (word3: # of occurrences)}
3 Implementation
The regularized cross-entropy function can be written as:
where xi 2 Rd, w 2 Rd, and () is the sigmoid function. is regularization coe cient and w0 is bias parameter. Note that we dont regularize bias term b.
Stopping Criteria. For both algorithms, run for 50 iterations.
Step size. For gradient method, you will use a fixed step size. Recall that Newtons method does not require a step size.
Initialization. For batch gradient descent, initialize the weight w to 0, and b to 0.1. For Newtons method, set initial weights to the ones we got from batch gradient descent after 5 iterations (when = 0.05, = 0.01). (Please note that Newtons method may diverge if initialization is not proper).
Extreme Condition. It is possible that when (b+ wTx) approaches 0, log( (b+ wTx)) goes to -infinity. In order to prevent such case, please bound the value (b+wTx) using small constant value, 1e 16. You can use the following logic in your code:
tmp = (b + wTx) if tmp < 1e 16 then tmp = 1e 16
3.1 Batch Gradient Descent
(Q2) Please write down the updating equation for w and b, for both unregularized logistic regression and regularized logistic regression. In particular, at iteration t using data points X = [x1,x2,,xn], where xi = [xi1,,xid], and yi 2 {0,1} is the label, how do we compute wt+1 and bt+1 from wt and bt?
(Q3) For step sizes = {0.001,0.01,0.05,0.1,0.5} and without regularization, implement Batch gradient descent (without cross-validation, use the whole training data for the gradient calculation).
- Plot the cross-entropy function value with respect to the number of steps (T = [1,,50]) for the training data for each step size. Note: you need to make two plots, one for each dataset.
- Report the L2 norm of vector w after 50 iterations for each step size i (fill in the table below)
L2 norm (without regularization) | 0.001 | 0.01 | 0.05 | 0.1 | 0.5 |
Ionosphere EmailSpam |
(Q4) For step sizes = {0.001,0.01,0.05,0.1,0.5} and with regularization coe cients = {0,0.05,0.1,0.15,,0.5}, do the following:
- Given = 0.1, plot the cross-entropy function value with respect to the number of steps (T = [1,,50]) for the training data for each step size using dierent step sizes. Note: you need to make two plots, one for each dataset.
- Given = 0.01, report the L2 norm of vector w after 50 iterations for each regularization coe cient i (fill in the table below).
- Plot the cross entropy function value at T = 50 for dierent regularization coe cients, for both the training and test data. The x-axis will be the regularization coe cient and y-axis will be the cross entropy function value after 50 iterations. Each plot should contain two curves, and you should make 2 (two data sets) 5 (five dierent step sizes) = 10 plots.
L2 norm (with regularization, = 0.01) | 0 | 0.05 | 0.1 | 0.15 | 0.2 | 0.25 | 0.3 | 0.35 | 0.4 | 0.45 | 0.5 |
Ionosphere EmailSpam |
3.2 Newtons method
Optimization also can be done using the 2nd order technique called Newtons method. We define H as the Hessian matrix and rt as the gradient of objective function at iteration t.
(Q5) Using the notation above, write down the updating equation for w and b at time t, for both unregularized logistic regression and regularized logistic regression.
(Q6) Implement Newtons method for logistic regression without regularization, and run it for 50 iterations on both of datasets.
- Plot the cross-entropy function value with respect to the number of steps (T = [1,,50]) for the training data. Note: you need to make two plots, one for each dataset.
- Report the L2 norm of vector w after 50 iterations.
- Report the cross-entropy function value for the test data.
(Q7) Repeat (Q6) for the regularized case using = {0,0.05,0.1,0.15,,0.5} (report values for each setting).
3.3 Analysis and Comparison of Gradient Descent and Newtons Method
(Q8) Briefly (in no more than 4 sentences) explain your results from (Q3) and (Q4). You should discuss the rate of convergence as a function , the change in magnitude of w as a function of , the value of the cross entropy function for dierent values of and , and any other interesting trends you have observed.
(Q9) Briefly (in no more than 4 sentences) discuss the dierences between gradient descent and Newtons method based on the results from (Q4) and (Q7). In particular, discuss the dierence between gradient descent and Newtons method in terms of convergence and computation time.
Reviews
There are no reviews yet.