- Let {(x1,y1),,(xn,yn)} be a given training set where xi Rd,yi {0,1}. We consider the following regularized logistic regression objective function:
,
where > 0 is a constant. Let w be the global minimizer of the objective, and let kwk2 c, for some known constant c > 0.
- Clearly show and explain the steps of the projected gradient descent algorithm for optimizing the regularized logistic regression objective function. The steps should include an exact expression for the gradient.
- Is the objective function strongly convex? Clearly explain your answer by stating and using the definition of strong convexity.[1]
- Is the objective function smooth? Clearly explain your answer by stating and using the definition of smoothness.1
- Let wT be the iterate after T steps of the projected gradient descent algorithm. What is a bound on the difference f(wT)f(w)? Clearly explain all quantities in the bound (hint:please consider step size).
- Let X = {x1,,xn},xi Rd be a set of n samples drawn i.i.d. from a mixture of k multivariate Gaussian distributions in Rd. For component Gh,h = 1,,k, let h,h,h respectively denote the prior probability, mean, and covariance matrix of Gh. We will focus on the expectation maximization (EM) algorithm for learning the mixture model, in particular for estimating the parameters {(h,h,h),h = 1,,k} as well as the posterior probabilities p(Gh|xi).
- In your own words, describe the EM algorithm for mixture of Gaussians, highlighting the two key steps (E- and M-), illustrating the methods used in the steps on a high level, and what information they need.
- Assuming the posterior probabilities p(Gh|xi) are known, show the estimates of the component prior, mean, and covariance h,h,h,h = 1,,k given by the Mstep (you do not need to show the derivation).
- Assuming the component prior, mean, and covariance h,h,h,h = 1,,k are known, show how the posterior probabilities p(Gh|xi) are computed in the E-step.
Programming assignments:
The next two problems involve programming. For Question 3, we will be using the 2-class classification datasets from Boston50 and Boston75, and for Question 4, we will be using the the Digits dataset. For Q3, we will develop code for 2-class logistic regression with only one set of parameters (w,w0). For Q4, we will develop code for PCA.
- We will develop code for 2-class logistic regression with one set of parameters (w,w0) where w Rd, w0 R. Assuming the two classes are {0,1}, and the data x Rd, the posterior probability of class C1 is given by
T + w0)
,
1 + exp(w x + w0)
and P(0|x) = 1 P(1|x). f We will develop code for MyLogisticReg2 with corresponding
MyLogisticReg2.fit(X,y) and MyLogisticReg2.predict(X) functions. Parameters for the model can be initialized following suggestions in the textbook. In the fit function, the parameters will be estimated using gradient descent as described in the textbook and in class.
We will compare the performance of MyLogisticReg2 with LogisticRegression[2] on two datasets: Boston50 and Boston75. Using my cross val with 5-fold cross-validation, report the error rates in each fold as well as the mean and standard deviation of error rates across all folds for the two methods: MyLogisticReg2 and LogisticRegression, applied to the two 2-class classification datasets: Boston50 and Boston75.
You will have to submit (a) code and (b) summary of results:
(a) Code: You will have to submit code for MyLogisticReg2() as well as a wrapper code q3().
For MyLogisticReg2(), you are encouraged to consult the code for MultiGaussClassify() from HW2. You need to make sure you have init , fit, and predict implemented in MyLogisticReg2. init (d) will initialize the parameters (w,w0) and will take the data dimensionality d as input. fit(X,y) will take the data features X and labels y and will use gradient descent to estimate the parameters w,w0. Convergence of gradient descent can be determined by checking the difference between subsequent iterates
) and () and making sure the change is below a pre-specified (small)
threshold. You can also specify tmax, the maximum number of iterations gradient descent can run, and set it to a suitably large value. predict(X) will take a feature matrix corresponding to the test set and return the predicted labels. Your class MyLogisticReg2() will not inherit any base class in sklearn.
The wrapper code (main file q3.py) has no input and is used to prepare the datasets, and make calls to my cross val(method,X,y,k) to generate the error rate results for each dataset and each method. The code for my cross val(method,X,y,k) must be yours (e.g., code you wrote in HW1 with modifications as needed) and you cannot use cross val score() in sklearn. The results should be printed to terminal (not generating an additional file in the folder). Make sure the calls to my cross val(method,X,y,k) are made in the following order and add a print to the terminal before each call to show which method and dataset is being used:
- MyLogisticReg2 with Boston50;
- MyLogisticReg2 with Boston75;
- LogisticRegression with Boston50;
- LogisticRegression with Boston75.
One should be able to run your wrapper code q3.py by python q3.py in command line window.
(b) Summary of results: For each dataset and each method, report the test set error rates for each of the k = 5 folds, the mean error rate over the k folds, and the standard deviation of the error rates over the k folds. Make a table to present the results for each method and each dataset (4 tables in total). Each column of the table represents a fold and add two columns at the end to show the overall mean error rate and standard deviation over the k folds. For example:
Error rates for MyLogisticReg2 with Boston50 | ||||||
Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean | SD |
# | # | # | # | # | # | # |
- We will modify the provided ipython notebook generate digits.ipynb to add our own code for the parts where PCA is getting used. The notebook shows how to learn and subsequently generate data which look like the data from the Digits dataset. Let the data matrix be denoted as X RDn where D is the number of features and n is the number of samples. Each column of X corresponds to a data point xj RD. For Digits, we have D = 64 and n 1800. The feature covariance matrix of the data can be computed as:
, (1)
where is the mean of the data points. The notebook has the following steps:
- Compute a PCA projection Z Rdn,d D of the original data X RDn so that % of the variance is preserved in the projected space. The notebook has details for = 90, where d = 21 was sufficient. The notebook also has details for = 99, where d = 41 was sufficient. The projection can be represented by a matrix W RdD.
In the homework, instead of using sklearns functionality for determining d and W, we will develop our own function myPCA (see below) using numpy and scipy as needed. This will be the only major change you will have to do in the notebook.
- Consider the dataset Z = {z1,,zn} of n points in Rd, and fit a mixture of Gaussians (MoG) model where the number of mixture components is determined by a suitable model selection technique. The notebook already implements this part, and we will not change it.
- Sample a set of m new points Z = {z1,,zm},zi Rd from the MoG model in the projected space. Let Z Rdm be the matrix corresponding to the sampled data Z, where the columns are zi. The notebook already implements this part with m = 100, and we will not change it.
- Inverse transform the new points Z in the d-dimensional space to the original Ddimensional space. The notebook currently uses sklearns functionality for doing the inverse transformation. Instead we will use our own code to do the inverser transformation. In particular, the inverse transformation can be computed using W and which we will get as outputs from myPCA (see below). The inverse transform is a simple linear algebraic operation (say, 1-2 lines of code), so we will not write a separate function for this, but just inline the code.
The main function we will write is myPCA(X,), which takes in the original data X RDn and the percentage [0,100] of variance to be preserved, and returns three things as a tuple:
- the PCA projection dimensionality d which is sufficient to preserve % of the original variance in the projected space,
- the projection matrix W RdD, and
- the estimated mean RD of the original data.
Add myPCA as a function in the notebook and update the rest of the notebook to use myPCA instead of sklearns PCA. Please make sure the current functionality in the notebook is maintained. In particular, your notebook should not import PCA but still maintain the same functionality. You will have to submit your notebook my generate digits.ipynb.
[1] Since the function is twice differentiable, you can answer the question by considering the second derivative.
[2] You should use LogisticRegression from scikit-learn, similar to HW1 and HW2.
Reviews
There are no reviews yet.