, , , , , ,

[SOLVED] Ee-556 homework exercise-1 (for lectures 3–5)

$25

File Name: Ee_556_homework_exercise_1__for_lectures_3___5_.zip
File Size: 442.74 KB

5/5 - (1 vote)

This homework covers the lectures 3-6 in which we consider binary classification by the linear support vector machine (SVM), and
you will compute the SVM classifier by the gradient descent and accelerated gradient methods. You will also explore how algorithmic
enhancements, such as backtracking linesearch and adaptive restart strategies, improve the numerical efficiency. In the second part,
you will implement the Newton and quasi-Newton methods to compute the SVM classifier. Finally, you will implement three versions
of stochastic gradient descent methods.
1 Problem statement
The support vector machine is one of the most popular approaches to linear binary classification. Given a training data set of n points of
the form
(a1, b1), . . . ,(an, bn)
where ai ∈ R
p and bi ∈ {−1, +1} indicates the class to which ai belongs. The goal of linear binary classification is to learn a hyperplane
in R
p
, based on the training data set, which separates the points with label +1 and those with label −1. Notice that this goal cannot
be achieved perfectly if the labels in the training data set are noisy, e.g., each of the label is incorrect with probability ε ∈ (0, 1/2)
independently, and the ratio of incorrectly classified points in another test data set becomes an important performance measure.
The empirical risk minimization (ERM) principle in Lecture 2 suggests that we can consider the hyperplane that minimizes the
empirical 0 − 1 loss on the training set. That is, we can compute (x
?
, µ) ∈ R
p × R defined by
(x
?
, µ?
) ∈ arg min
(x,µ)∈Rp×R

`0−1(x, µ) =
1
n
Xn
i=1
1{bi
(a
T
i
x+µ)≤0}
(x, µ)

, (1)
where
1{bi
(a
T
i
x+µ)≤0}
(x, µ) =
(
1 if bi(a
T
i
x + µ) ≤ 0,
0 otherwise.
Then {y : y
T x
? + µ
? = 0, y ∈ R
p
} is a separating hyperplance. We note that `0−1 is not convex. To overcome the computational issue, `0−1
is usually replaced by an approximation called the Hinge loss:
`H(x, µ) =
1
n
Xn
i=1
max 
1 − bi(a
T
i x + µ), 0

,
and (1) could be approximated by the Hinge loss minimization problem, plus a regularization term, λ
2
kxk
2
. This formulation is known
as linear support vector machine. Note that `H is not continuously differentiable, thus we want to consider a slightly different loss
function. One possible replacement is the smoothed Hinge loss:
`SH(x, µ) =
1
n
Xn
i=1
gi(x, µ)
such that
gi(x, µ) =



1
2
− bi(a
T
i
x + µ), bi(a
T
i
x + µ) ≤ 0
1
2
(1 − bi(a
T
i
x + µ))2
, 0 < bi(a
T
i
x + µ) ≤ 1
0, 1 ≤ bi(a
T
i
x + µ).
Finally, define the objective function f as
f(x, µ) := `SH(x, µ) +
λ
2
kxk
2
. (2)
LIONS @ EPFL Prof. Volkan Cevher
In this homework, we will learn a separating homogeneous half-space by smoothed Hinge loss minimization. Specifically, we will
consider the case when µ = 0, solve the following SVM problem with smoothed Hinge loss
x
?
∈ arg min
x
f(x, 0), (3)
and use {y : y
T x
? = 0, y ∈ R
p
} as the separating hyperplane. We will write f(x) for f(x, 0) for simplicity.
Note: without any more explications, matrix norm in the following is understood as Frobenius norm, 0 and 1 represent zeros
and ones vectors or matrices.
PROBLEM 1: (10 POINTS) – GEOMETRIC PROPERTIES OF THE OBJECTIVE FUNCTION f
Given x ∈ R
p
, we denote by IL, IQ and I0 n×n matrices such that IL(i, i) = 1 if bi(a
T
i
x+µ) ≤ 0, IQ(i, i) = 1 if 0 < bi(a
T
i
x+µ) ≤ 1, I0(i, i) = 1
if bi(a
T
i
x + µ) ≥ 1 and 0 otherwise. Set A := [a1, · · · , an]
T and b := [b1, · · · , bn]
T
. For convenience, we also set A˜ := [b1a1, · · · , bnan]
T
.
(a) (5 points) Do necessary calculations to show that
∇f(x) = λx +
1
n
A˜T IQ[A˜ x − 1] −
1
n
A˜T IL1
and then explain that ∇f is L-Lipschitz continuous with L = λ +
1
n
kA
T
k · kAk.
Hint: Observe that L = 0 for the linear part, i.e. when bi(a
T
i
x+µ) ≤ 0. Hence, it is sufficient to compute L for the quadratic region,
i.e. 0 < bi(a
T
i
x + µ) ≤ 1, by assuming that all the samples ai
lie in that region. Use the fact that kAk = kA˜ k, kA
T
k = kA˜ T
k, and
λx +
1
n
A˜T IQ[A˜ x − 1] = λx +
1
n

T
[A˜ x − 1]
(b) (3 points) Suppose that IQ = I where I is the identity matrix. Explain why f is twice-differentiable at x and do necessary
calculations to show that

2
f(x) = λI +
1
n
A
TA. (4)
Note that f is not twice differentiable at x if IQ = I does not hold. However, we can still use the following

2
f(x) =



λI +
1
nA
TA, if f is twice-differentiable at x,
λI, otherwise.
as Hessian in Newton method.
(c) (2 points) Show that f is λ-strongly convex.
2 Numerical methods for linear support vector machine
Our aim in the remainder of this homework is to implement different optimization algorithms for solving the regularized support
vector machine problem (3). We will use the breast-cancer dataset from the UCI Machine Learning Repository [?] consisting of n = 546
samples, each with p = 10 features. In all experiments, λ = 0.0001. You can write the codes either in Matlab or Python. Here are the
descriptions of the codes:
Matlab Python Description
compute_error.m commons.compute_error compute errors
Oracles.m commons.Oracles return function values, gradients, stochastic gradients
GD.m algorithms.GD Gradient Descent
GDstr.m algorithms.GDstr Gradient Descent (strongly convex)
AGD.m algorithms.AGD Accelerated Gradient Descent
AGDstr.m algorithms.AGDstr Accelerated Gradient Descent (strongly convex)
LSGD.m algorithms.LSGD Gradient Descent with line search
LSAGD.m algorithms.LSAGD Accelerated Gradient Descent with line search
AGDR.m algorithms.AGDR Accelerated Gradient Descent with restart
LSAGDR.m algorithms.LSAGDR Accelerated Gradient Descent (line search + restart)
AdaGrad.m algorithms.AdaGrad Adaptive Gradient Method
ADAM.m algorithms.ADAM Adaptive Moment estimation algorithm
SGD.m algorithms.SGD Stochastic Gradient Descent
SAG.m algorithms.SAG Stochastic Averaging Gradient
SVR.m algorithms.SVR Stochastic Gradient Descent (Variance Reduction)
SVM.m SVM.py Main code
2
LIONS @ EPFL Prof. Volkan Cevher
General guides:
• Your are required to complete the missing parts in the codes, indicated by markers YOUR CODES HERE.
• The scripts SVM.m (Matlab) and SVM.py (Python) are to test your results. For instance, change GD_option = 1 in Python (or GD
= 1 in Matlab) if you want to test GD and similarly to the other algorithms. By default these values are 0.
• Put the resulting figures and record the resulting ratios of incorrectly classified points in the test data set in your report.
• For Python coders: in provided codes, an 1D- vector of length m in Matlab is understood to be an array of size m × 1 while
it is understood to be a Numpy array of shape (m, ). If you choose to code in Python, you might have to have some reshape
operations when you work with Quasi-Newton method.
2.1 First order methods for linear support vector machine
The optimality condition of (3) is
∇f(x
?
) = λx
? +
1
n
A˜T IQ[A˜ x
? − 1] −
1
n
A˜T IL1 = 0. (5)
Condition (5) is in fact the necessary and sufficient condition for x
?
to be optimal for (3). We can equivalently write this condition as
x
? = x
? − B∇f(x
?
) (6)
for any symmetric, positive definite matrix B. This is a fixed-point formulation of (3), which will be used to develop gradient and
Newton-type methods.
PROBLEM 2: (65 POINTS) – FIRST ORDER METHODS FOR SVM
Notice that choosing B = αI with α > 0 in the formulation (6) suggests using the gradient descent method to solve (3). In this problem, you will implement various variants of gradient descent algorithm. We have defined 3 input arguments (see the documentations
in Oracles.m and commons.Oracles to know how to use them in your codes):
1. fx : The function that characterizes the objective to be minimized;
2. gradf : The function that characterizes the gradient of the objective function fx;
3. parameter: The structure (Matlab) or the dictionary (Python) that includes the fields maxit (number of iterations), x0
(initial estimate), strcnvx (strongly convex constant) and Lips (Lipschitz constant).
(a) (5 points) The main ingredients of the gradient descent algorithm are the descent direction (or search direction) d
k and step-size
αk
. In this part, we consider the gradient descent algorithm with constant step size 1/L, i.e., αk = 1/L for any k = 0, 1, … . Recall
the L computed from Problem 1.(a).
Implement this algorithm by completing GD.m or algorithms.GD.
Note that the objective function is strongly convex. Therefore, we can select the constant step-size α as 2/(L + λ) to get a faster
convergence rate. Implement this variant of the gradient descent method, by completing the code in GDstr.m or algorithms.
GDstr.
(b) (10 points) We can accelerate the above gradient descent algorithm as follows (t0 = 1):



x
k+1
:= y
k − αk∇f(y
k
),
tk+1 :=
1
2
(1 +
q
1 + 4t
2
k
)
y
k+1
:= x
k+1 +
tk−1
tk+1

x
k+1 − x
k

.
where y0 = x0. Details of this algorithm can be found in Lecture 4.
Implement this algorithm with constant step size α = 1/L, by completing the missing parts in AGD.m or function AGD in
algorithms.AGD.
Note that the objective function is strongly convex. Therefore, we can use the accelerated gradient algorithm for strongly convex
objectives to get faster convergence. This variant can be summarized as follows:



x
k+1
:= y
k − αk∇f(y
k
),
y
k+1
:= x
k+1 +

L−

λ √
L+

λ

x
k+1 − x
k

.
Implement this variant of the accelerated gradient method by completing the code in AGDstr.m or algorithms.AGDstr.
3
LIONS @ EPFL Prof. Volkan Cevher
(c) (15 points) We can obtain better performance by considering a line search procedure, which adapts the step-size αk to the local
geometry. The line-search strategy to determine the stepsize αk for the standard GD
x
k+1 = x
k − αk∇f(x
k
).
is the follow: At the step k, let x
k be the current iteration and d
k = −∇f(x
k
) be a given descent direction, and perform:
• Set L0 = L.
• At each iteration, set Lk,0 =
1
2
Lk−1, where k is the iteration counter.
• Using a for loop, find the minimum integer i ≥ 0 that satisfies f

x
k +
1
2
iLk,0
d
k

≤ f(x
k
) −
1
2
i+1Lk,0
kd
k
k
2
.
• Set Lk = 2
iLk,0 and use the step-size αk
:=
1
Lk
(i.e., use the new estimate that you have used in the line-search:
x
k +
1
2
iLk,0
d
k
).
Complete the missing parts in the files LSGD.m or algorithms.LSGD in order to implement gradient descent with line-search.
We now incorporate a line-search with accelerated gradient method in (b) as follow: at step k, one has the current iterations x
k
together with intermediate variable y
k and its corresponding direction d
k = −∇f(y
k
). Note that the intermediate variable is then
used in the gradient step and hence the line-search will be performed on it to determine the stepsize.
• Perform a line-search strategy as above with respect to y
k and direction d
k
to determine the step-size αk as follows:
– Set L0 = L.
– At each iteration, set Lk,0 =
1
2
Lk−1, where k is the iteration counter.
– Using a for loop, find the minimum integer i ≥ 0 that satisfies f

y
k +
1
2
iLk,0
d
k

≤ f(y
k
) −
1
2
i+1Lk,0
kd
k
k
2
.
– Set Lk = 2
iLk,0 and use the step-size αk
:=
1
Lk
.
• Update the next iterations:



x
k+1
:= y
k − αk∇f(y
k
),
tk+1 :=
1
2
(1 +
q
1 + 4
Lk
Lk−1
t
2
k
)
y
k+1
:= x
k+1 +
tk−1
tk+1

x
k+1 − x
k

.
Complete the missing parts in the files LSAGD.m or algorithms.LSAGD in order to implement accelerated gradient descent
with line-search.
(d) (15 points) The accelerated gradient method is non-monotonic, so it can be oscillatory, i.e. f(x
k+1
) f(x
k
) for all k ≥ 0. To prevent
such behavior, we can use the so-called adaptive restart strategy. Briefly, one such strategy can be explained as follows: At each
iteration, whenever x
k+1
is computed, we evaluate f(x
k+1
) and compare it with f(x
k
):
• If f(x
k
) < f(x
k+1
), restart the iteration, i.e., recompute x
k+1 by setting y
k
:= x
k and tk
:= 1;
• Otherwise, let the algorithm continue.
This strategy requires the evaluation of the function value at each iteration, which increases the computational complexity of the
overall algorithm. Implement the adaptive restart strategy which uses the function values for the accelerated gradient algorithm
with constant step-size αk = 1/L by completing the files AGDR.m or function AGDR in algorithms.AGDR.
Incorporate the line-search, acceleration and function values restart by completing LSAGDR.m or function LSAGDR in algorithms.
LSAGDR.
Hint: Note that while the restart is executed on x
k
, the line-search strategy is executed on y
k and the direction d
k = −∇f(y
k
) to
determine the stepsize and hence, use line-search each time you encounter an intermediate variable y
k
.
(e) (10 points) We can also apply an optimization techniques that does not exploit the knowledge of the Lipschitz constant, and
instead adapts to the local geometry by making use of past gradient information. AdaGrad adapts the step size using the
inverse square-norm of past gradients. Starting with Q0 = 0 it iterates as follows:



Qk = Qk−1 + k∇f(x
k
)k
2
Hk = (

Qk + δ)I
xk+1 = xt − αH
−1
k ∇f(x
k
)
4
LIONS @ EPFL Prof. Volkan Cevher
Complete the missing parts in the files AdaGrad.m or algorithms.AdaGrad in order to implement the above adaptive gradient method using α = 1, δ = 10−5
.
(f) (10 points) Another famous adaptive optimization method is called the ADAptive Moment estimation algorithm, a.k.a ADAM.



gk = ∇f(xk)
mk = β1mk−1 + (1 − β1)gk ← Momentum
vk = β2vk−1 + (1 − β2)g
2
k ← Adaptive term
mˆk = mk/(1 − β
k
1
)
vˆk = vk/(1 − β
k
2
) ← Scaling for removing bias
Hk =

vˆk + 
xk+1 = xk − αmˆk/Hk
Note that all operations shown above, when applied to vectors, are applied component-wise. In particular, g
2
k
is a vector of the
same size as gk where all elements are squared.
Complete the missing parts in the files ADAM.m or algorithms.ADAM in order to implement the above adaptive gradient method
using α = 0.1, β1 = 0.9, β2 = 0.999 and  = 10−8
.
It has been shown that ADAM can fail to converge to the global minimum of a convex problem [1]. The authors provided a variant
of ADAM, called AMSgrad in order to fix this convergence issue. However, in practice it is not clear which method performs best.
(You are not required to implement this method, but advised to have a look at it for personal interest).
2.2 Stochastic gradient method for SVM
In this problem, you will implement three versions of stochastic gradient descent method to solve SVM. We have defined 4 input
arguments (see the documentations in Oracles.m and commons.Oracles to know how to use them in your codes):
1. fx : The function that characterizes the objective to be minimized;
2. gradf : The function that characterizes the gradient of the objective function fx;
2. gradfsto : The function that characterizes the stochastic gradient of the objective function fx;
4. parameter: The structure (Matlab) or the dictionary (Python) that includes the fields maxit (number of iterations), x0 (initial estimate), strcnvx (strongly convex constant), Lmax (see definition below) and no0functions (number of functions).
PROBLEM 4: (25 POINTS) – STOCHASTIC GRADIENT METHODS FOR SVM
To use the stochastic gradient descent, we recast (3) as follow
f(x) =
1
n
Xn
i=1

gi(x, 0) +
λ
2
kxk
2
| {z }
fi
(x)

. (SVM)
Let 1{predicate} = 1 if the predicate is true, 0 otherwise. Similarly as in Problem 1, we can deduce that
∇fi(x) = λx + 1{0<bi
(a
T
i
x)≤1}ai(a
T
i x − bi) − 1{bia
T
i
x≤0}biai
.
Consider the following stochastic gradient update: at the iteration k, pick ik ∈ {1, . . . , n} uniformly at random and define
x
k+1
:= x
k − αk∇fik
(x
k
). (SGD)
(a) (5 points) Show that ∇fik
(x) is an unbiased estimation of ∇f(x). Explain why ∇fik
is Lipschitz continuous with L(fik
) = kaik
k
2 + λ.
As a hint, recall how we upper bounded L in Problem 1. Set Lmax = maxi∈{1,…,n} L(fi)
(b) (5 points) We can use the standard stochastic gradient descent (SGD) to solve (SVM). Complete the following codes in Matlab:
SGD.m or algorithms.SGD using the following stepsize rule αk =
1
k
.
(c) (7 points) Consider the following stochastic averaging gradient method (SAG) to solve (SVM):



pick ik ∈ {1, . . . , n} uniformly at random
x
k+1
:= x
k −
αk
n
Pn
i=1 v
k
i
,
5
LIONS @ EPFL Prof. Volkan Cevher
where
v
k
i =



∇fi(x
k
) if i = ik
,
v
k−1
i
otherwise.
Complete the following codes in Matlab: SAG.m or algorithms.SAG using the stepsize αk =
1
16Lmax and v
0 = 0.
(d) (8 points) We can get faster convergence rate for SGD by using the following version of SGD with variance reduction:



x˜ = x
k
, v
k = ∇f(x˜), x˜
0 = x˜
For l = 0, . . . , q − 1 :
Pick il ∈ {1, . . . , n} uniformly at random
v
l = ∇fi
l
(x˜
l
) − ∇fi
l
(x˜) + v
k

l+1
:= x˜
l − γv
l
x
k+1 =
1
q
Pq−1
l=0

l+1
.
Complete the following codes in Matlab SVR.m or the following codes in Python algorithms.SVR with the following rules:
γ = 0.01/Lmax and q = [1000 ∗ Lmax], i.e., q is the integer part of 1000 ∗ Lmax.
3 Guidelines for the preparation and the submission of the homework
Work on your own. Do not copy or distribute your codes to other students in the class. Do not reuse any other code related to this
homework. Here are few warnings and suggestions for you to prepare and submit your homework.
• This homework is due at 4:00PM, 1st of November, 2019.
• Submit your work before the due date. Late submissions are not allowed and you will get 0 point from this homework if you
submit it after the deadline.
• Your final report should include detailed answers and it needs to be submitted in PDF format. The PDF file can be a scan or
a photo. Make sure that it is readable. The results of your simulations should also be presented in the final report with clear
explanations.
• You can use MATLAB or Python for the coding exercises. We provide some incomplete MATLAB and Python scripts. You need
to complete the missing parts to implement the algorithms. Depending on your implementation, you might also want to change
some written parts and parameters in the provided codes. In this case, indicate clearly your modifications on the original code,
both inside the codes with comments, and inside your written report. Note that you are responsible for the entire code you
submit.
• Your codes should be well-documented and they should work properly. Make sure that your code runs without errors. If the
code you submit does not run, you will not be able to get any credits from the related exercises.
• Compress your codes and your final report into a single ZIP file, name it as ee556_2019_hw1_NameSurname.zip, and submit
it through the moodle page of the course.
• Discussing concepts with other students is OK; however, each homework exercise should be attempted and completed individually. You should write your report on your own, with your own words and understanding. Your reports and your codes
will be checked thoroughly. Reports with overly similar unjustified statements will be considered as copying, and copying and
cheating will not be tolerated. The first time results in zero point for the corresponding homework, and the second time results
in zero point for the whole course.
References
[1] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
6

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Ee-556 homework exercise-1 (for lectures 3–5)[SOLVED] Ee-556 homework exercise-1 (for lectures 3–5)
$25