1 MULTICLASS CLASSIFICATION – 70 POINTS
In Lecture 2 (slide 19) you were introduced binary logistic regression [3] and learned how to derive the expression of its maximum
likelihood estimator. We briefly remind you of the procedure and then guide you through extending these notions to multiple classes.
We are working with data represented by pairs
a
T
i
, bi
, where a
T
i
is the i
th row of matrix A ∈ R
n×d and bi
is the i
th entry of b ∈ R
n
. Each
bi denotes the class label associated to sample a
T
i
and we have a total of C = 2 classes, which we encode as {−1, 1}. We denote by n the
number of data samples and by d the problem dimension. Lastly, we aim to model our data with a feature vector x ∈ R
d
. For clarity,
refer to the visual representations below.
A =
← a
T
1 →
← a
T
2 →
. . .
← a
T
n →
∈ R
n×d b =
b1
b2
.
.
.
bn
∈ {−1, 1}
n
x =
x1
x2
.
.
.
xd
∈ R
d
(1)
The logistic model assumes there is a linear relationship between x and the logarithm of the odds of bi = 1
1
, given ai
. Formally, we
can write this as
log P (bi = 1 | ai
, x)
P (bi = −1 | ai
, x)
| {z }
Odds of bi=1 to bi=−1.
= a
T
i x
C=2 ⇐⇒ log P (bi = 1 | ai
, x)
1 − P (bi = 1 | ai
, x)
| {z }
This quantity is called a logit,
hence the name logistic regression.
= a
T
i x ⇐⇒ P (bi = 1 | ai
, x) =
e
a
T
i
x
1 + e
a
T
i
x
| {z }
The logistic
function.
Assuming the data samples are statistically independent, we can write down the probability of observing vector b as:
P(b | A, x) = P (b1 ∧ b2 ∧ . . . ∧ bn | A, x) =
Yn
i=1
P (bi
| ai
, x) (2)
Expression (2) is called the likelihood of observing b, given data A and parameters x. The goal of logistic regression is to find a feature
vector xˆML that maximizes the likelihood of observing b.
We can find xˆML using the numerical optimization methods you have seen in the lectures. The objective function is chosen as the
negative logarithm of the likelihood 2
, as minimizing it is equivalent to maximizing P(b | A, x). The expression of the maximum
likelihood estimator is therefore given by:
xˆML ∈ arg min
x
{− log P(b | A, x) =
Xn
i=1
log h
1 + exp
−bia
T
i x
i : x ∈ R
d
}.
We now extend the binary setting to C > 2 classes, which naturally leads to having correspondingly many feature vectors. Note that
one may use the same technique as is the binary case, by setting one of the classes as the pivot (for example C) and writing the log-odds
against it. This effectively leaves us with C − 1 feature vectors because we fix xC by construction. We will take a different approach
that is (arguably) more natural and easier to implement in PART II of this exercise. This approach uses C feature vectors represented
as a matrix X ∈ R
d×C
. For clarity, refer to the visual representation below.
X =
↑ ↑ ↑
x1 x2 . . . xC
↓ ↓ ↓
∈ R
d×C
xi =
x1i
x2i
.
.
.
xdi
∈ R
d
(3)
1The choice of class label does not matter – we could have just as well chosen -1.
2Operating with sums is easier, hence the logarithm. The minus is added for making the function convex.
LIONS @ EPFL Prof. Volkan Cevher
To arrive at our probabilistic model, we start from the following equations:
log (P(bi = 1 | ai
, X) S ) = a
T
i x1 =⇒ log P(bi = 1 | ai
, X) = a
T
i x1 − log S =⇒ P(bi = 1 | ai
, X) =
e
a
T
i
x1
S
log (P(bi = 2 | ai
, X) S ) = a
T
i x2 =⇒ log P(bi = 2 | ai
, X) = a
T
i x2 − log S =⇒ P(bi = 2 | ai
, X) =
e
a
T
i
x2
S
.
.
.
.
.
.
.
.
.
log (P(bi = C | ai
, X) S ) = a
T
i xC =⇒ log P(bi = C | ai
, X) = a
T
i xC − log S =⇒ P(bi = C | ai
, X) =
e
a
T
i
xC
S
Different from the binary case, the probability in the denominator is replaced by a factor S that acts as a normalizing constant. We set
S =
PC
k=1
e
a
T
i
xk
in order for the probabilities to sum to 1. We therefore have:
P(bi = j | ai
, X) =
e
a
T
i
xj
PC
k=1
e
a
T
i
xk
PART I – THEORY (35 POINTS)
EXERCISE 1.I.1: (7 POINTS) – Following the same reasoning as for the binary case, show that the maximum likelihood estimator for multinomial
logistic regression problem has the following expression:
Xˆ ML ∈ arg min
X
f(X) | f : R
d×C → R, f(X) =
Xn
i=1
logXC
k=1
e
a
T
i
xk − a
T
i xbi
, bi ∈ {1, 2 . . .C}
. (4)
EXERCISE 1.I.2: (7 POINTS) – Show that the gradient with respect to X of the loss function f from Exercise I.1 has the following matrix form
(this will come in useful in the second part of the exercise where you have to implement the algorithms):
∇f(X) = A
T
Zexp(AX) − Y
,
where Z is an n × n matrix with Zi j =
1
PC
k=1
e
a
T
i
xk
, if i = j
0, othewrise
, Y is an n × C matrix whose rows y
T
i
are the one-hot-encodings 3
corresponding to data
bi and exp is applied entry-wise.
HINT: It is helpful to first compute the gradient of f with respect to a single entry xi j of X.
EXERCISE 1.I.3: (SELF-STUDY, 0 POINTS) – In this exercise, we are interested in showing that f is convex. We provide you with the expression
of the Hessian of f below 4
:
∇
2
f(X) =
Xn
i=1
Σi ⊗ aia
T
i
, (5)
where ⊗ is the Kronecker product, and Σi
is given by:
Σi =
σi1(1 − σi1) −σi1σ2 . . . −σi1σiC
−σi2σi1 σi2(1 − σi2) . . . −σi2σiC
.
.
.
.
.
.
.
.
.
.
.
.
−σiCσi1 −σiCσi2 . . . σiC(1 − σiC)
, with σi j =
e
a
T
i
xj
PC
k=1
e
a
T
i
xk
. (6)
Knowing that a function of multiple variables is convex iff its Hessian is PSD, use the expression of ∇
2
f above to prove that f is convex.
HINT:
• Start by showing that both Σi < 0 and aia
T
i < 0, ∀i. You can use the following: D < 0 ⇐⇒ z
TDz ≥ 0, ∀z.
3The one-hot-encoding of class k ≤ C is given by vector y ∈ R
C
, with yk = 1 and yj = 0, j , k. Precisely, if bi = 2, then yi has 1 on position 2 and 0 everywhere else.
4The interested student can refer to the Appendix for details of the derivation.
2
LIONS @ EPFL Prof. Volkan Cevher
• Then apply the following result [9]: D, E < 0 =⇒ D ⊗ E < 0.
• Finally, argue that given two sum-compatible matrices D, E < 0, you have D + E < 0.
EXERCISE 1.I.4: (8 POINTS) – We now seek to estimate the Lipschitz constant L of ∇f , as we will need it for PART II of this exercise. We know
that L is such that λmax(∇
2
f) ≤ L < ∞, where λmax(∇
2
f) is the largest eigenvalue of ∇
2
f(X).
a) Knowing that aia
T
i
is a rank-1 matrix, show that λmax(aia
T
i
) = kaik
2
2
.
b) Starting from (5), show that we can choose L =
kAk
2
F
2
HINT: You can use the following statements:
• For D, E < 0, we have λmax(D ⊗ E) = λmax(D)λmax(E).
• Given D < 0, we have λmax(d) ≤ maxi
P
j
Di j
(the top eigenvalue is upper-bounded by the maximum `1 norm of rows).
• 1 − σi j =
PC
k=1
k,j
σik. This comes from having a probability distribution over the C classes.
EXERCISE 1.I.5: (8 POINTS) – Sometimes we wish to enforce certain properties onto our solution (e.g. sparsity). This can be achieved by adding
a regularizer to our objective:
F(X) = f(X) + λg(X),
As we saw in Lecture 9, such a function can be minimized using proximal gradient algorithms, provided that g is ‘proximable’ (i.e., its proximal
operator is efficient to evaluate). We recall the proximal operator of g as the solution to the following convex problem:
proxg
(z) := arg min
y∈Rd
{g(y) +
1
2
ky − zk
2
2
}.
Given g : R
d → R, g(x) := kxk1, show that its proximal function can be written as
proxλg
(z) = max(|z| − λ, 0) ◦ sign(z), z ∈ R
d
(7)
where the operators max, sign and |·| are applied coordinate-wise to the vector z and ◦ stands for (x ◦ y)i = xiyi
. Such a regularizer imposes sparsity
on the solutions.
HINT: Show that relation (7) holds for z ∈ R. Then use the fact that kyk1 +
1
2
ky − zk
2
2 =
Pd
i=1
|yi
|+
1
2
(yi − zi)
2
to conclude that the operators can be
applied coordinate-wise.
EXERCISE 1.I.6: (5 POINTS) – Given g(x) :=
1
2
kxk
2
2
, show that its proximal function can be written as
proxλg
(z) =
z
1 + λ
. (8)
Such a regularizer bounds the growth of X’s components, thus helping with numerical stability.
PART II – HANDWRITTEN DIGIT CLASSIFICATION ( 35 POINTS)
In this section you will implement algorithms based on your theoretical derivations from PART I and test them on the well-known
MNIST dataset [4] consisting of 28 × 28 grayscale images of handwritten digits. Figure 1 below depicts a subset of samples pertaining
to each of the 10 classes.
3
LIONS @ EPFL Prof. Volkan Cevher
Figure 1: MNIST selected digit images.
Source: https://wikipedia.org/wiki/MNIST_database
The problem constants become C = 10, d = 28 × 28 = 784 and we choose n = 5000. The rows a
T
i
of our data matrix A are the vectorized5
representations of the digit images. Finally, the labels bi are integers of the set {0, 1, . . . 9}. The data is preprocessed and given to you in
compressed format in the file mnist_train_test_split.npz. The code for loading it is already provided in the main.py file.
For this exercise you will implement algorithms which solve the following regularized versions of the multiclass Logistic Regression:
• `1 regularized (or sparse) Logistic Regression:
F(X) = f(X) + λ
XC
k=1
kxkk1
. (9)
• `2 regularized Logistic Regression:
F(X) = f(X) +
λ
2
XC
k=1
kxkk
2
2
. (10)
Using the information in PART I of this exercise, as well as the contents of Lecture 9 you are asked to fill in the code gaps, run the
experiments and interpret the results.
EXERCISE 1.II.1: (20 POINTS) – In the file named algorithms.py you can find the incomplete methods whose code you need to fill in.
a) In the file operators.py fill in the methods l1_prox and l2_prox with the proximal operator expressions from PART I adapted to
problems (9) and (10), respectively. Also fill in the gradfx with the expression of ∇f from PART I.
b) Using the information in Lecture 9 fill in the codes of methods ista, fista (adapted to also handle restarting) and
gradient_scheme_restart_condition from the file algorithms.py. In the latter method, you need to implement the so-called
‘gradient restart condition’, described in depth in papers such as [2, 7]. Formally, the criterion is:
hY
t − X
t+1
, X
t+1 − X
t
i > 0,
where t is the iteration index. The term Y
t − Y
t+1
can be seen as a gradient, while X
t+1 − X
t
is the descent direction of the momentum term.
Overall this criterion resets the momentum of FISTA to 0 when it goes in a bad descent direction. Note that we are working with matrices,
so we need to adapt the inner product correspondingly: hD, Ei = T r(D
T E).
NOTE: The Y in the gradient restart condition refers to the FISTA iterate. Do not confuse it with the one-hot-encoding matrix from (1).
c) Using Lecture 9 as guideline, fill in the method prox_sg form the file algorithm.py and stocgradfx from operators.py. The
latter method needs to return the minibatch stochastic gradient of f . It is not difficult to show that given ms stochastic samples of A i.e., rows
of A chosen uniformly at random, then
∇˜
ms
f(X) =
n
ms
Xms
i=1
ai
e
a
T
i
X
PC
k=1
e
a
T
i
xk
− y
T
i
5A vectorized representation of a matrix Y ∈ R
n×d
is given by a vector y ∈ R
nd obtained by arranging the columns of Y on top of one another.
4
LIONS @ EPFL Prof. Volkan Cevher
is an unbiased stochastic gradient of ∇f(X). The minibatch size and learning rate decrease function are given to you in the params variable
of the main method – please make sure to use those.
Note: For the prox_sg method you need to record convergence with respect to the ergodic iterate Xˆ t =
Pt
j=1
γjX
j
Pt
j=1
γj
, where γj
is the learning rate
at step j.
EXERCISE 1.II.2: (10 POINTS) – Running the script provided in main.py will produce plots of the convergence for your implementation of
the 4 methods above.
a) Insert the convergence plots into your report.
b) Run the deterministic methods for 2000 iterations and comment on how the convergence rates observed in your plots compare to the theoretical
ones (refer to the lecture notes). Are the practical rates better, worse or as expected?
c) For the deterministic methods, compare the convergence speed of FISTA relative to ISTA, and of FISTA-RESTART with respect to FISTA.
Explain your observations.
d) Run Prox-SG for 50000 iterations. Comment on the accuracy and convergence speed of deterministic methods versus Prox-SG. Note that
Prox-SG is plotted with respect to the number of epochs (full passes through the data), where an epoch is comparable to a full gradient.
Explain your observations.
e) In the main.py file, you have calls to the method plot_digit_features. This method takes as input the feature matrices X obtained
from running FISTA with restart for 2000 iterations for both problems (9) and (10). Insert the plots produced by calling these methods into
the report. Do you observe any similarity between the feature vectors and their corresponding classes?
EXERCISE 1.II.3: (5 POINTS) – In the file called train_and_test_mnist_nn.py you are provided with the code for training and
evaluating a fully-connected three-layer neural net with ReLU activation functions. Identical to Logistic Regression, the network is trained on a
dataset with 5000 MNIST images. We now compare the test-set accuracies corresponding to the solutions obtained by the neural net, `1 and `2
regularized Logistic Regression. These accuracies are computed for a test set of 10000 MNIST images.
• Run train_and_test_mnist_nn.py until completion and insert the final test set accuracy printed on the screen.
• In the main.py template you are provided with code 6
that reports the test-set accuracy of the solution obtained by FISTA with gradient
scheme restart for the `1 and `2 regularized problems. Run the method for 2000 iterations and insert the two accuracies in your report.
• Is there a difference between the accuracy of the neural net solution and those obtained by FISTA with gradient scheme restart? Does the
Logistic regression model have the same expressive power as the neural net? Why / why not?
HINT: Think of linear versus nonlinear models.
2 IMAGE RECONSTRUCTION – 30 POINTS
In this exercise an image should be understood as the matrix equivalent of a digital grayscale image, where an entry represents the
intensity of the corresponding pixel.
2.1 Wavelets
A widely used multi-scale localized representation in signal and image processing is the wavelet transform.7 This representation is
constructed so that piecewise polynomial signals have sparse wavelet expansions [5]. Since many real-world signals can be composed
by piecewise smooth parts, it follows that they have sparse or compressible wavelet expansions.
Scale. In wavelet analysis, we frequently refer to a particular scale of analysis for a signal. In particular, we consider dyadic scaling,
such that the supports from one scale to the next are halved along each dimension. For images, which can be represented with
matrices, we can imagine the highest scale as consisting of each pixel. At other scales, each region correspond to the union of four
neighboring regions at the next higher scale, as illustrated in Figure 2.
The Wavelet transform. The wavelet transform offers a multi-scale decomposition of an image into coefficients related to different
dyadic regions and orientations. In essence, each wavelet coefficient is given by the scalar product of the image with a wavelet function
which is concentrated approximately on some dyadic square and has a given orientation (vertical, horizontal or diagonal). Wavelets
6The section which calls method compute_accuracy.
7This section is an adaption from Signal Dictionaries and Representations by M. Watkin (see http://bit.ly/1wXDDbG).
5
LIONS @ EPFL Prof. Volkan Cevher
Figure 2: Dyadic partitioning of the unit square at scales j = 0, 1, 2. The partitioning induces a coarse-to-fine parent/child relationship
that can be modeled using a tree structure.
are essentially bandpass functions that detect abrupt changes in a signal. The scale of a wavelet, which controls its support both in
time and in frequency, also controls its sensitivity to changes in the signal.
An important property of the wavelet functions is that they form an orthonormal basis W. Formally, this means WWT = WTW = I,
where I is the identity operator. In the discrete case, W is just an orthonormal matrix.
Implementation. We provide the codes that efficiently compute the 2D Wavelet transform W and its inverse W−1 = WT
. Operators W
and WT
receive a vectorized image as input and outputs a vectorized transform. We base our implementation on the PyWavelets
library8
. You can find a usage example in the file example_wavelet.py. The interaction with the pywt library happens through the
class RepresentationOperator available in the utilities file.
• The method W(im) computes the transformation from an image im to its wavelet counterpart im_wav.
• The method WT(im_wav) computes the transformation from wavelet coefficients im_wav back to the image domain im.
Remark 1. Should the provided codes not work after you installed the required packages, directly contact one of the teaching
assistants.
2.2 Total Variation
The Total Variation (TV) was proposed by Rudin, Osher and Fatemi [8] as a regularizer for solving inverse problems. There is compelling empirical evidence showing that this regularization does not blur the sharp edges of images and as a consequence, significant
research efforts went into developing efficient algorithms for computing the TV proximal operator. In this exercise we use the approach developed by Chambolle [1].
Computing the TV of an image relies on the discrete gradient operator ∇x ∈ R
(m×m)×2
, formally given by (∇x)i, j =
(∇x)
1
i, j
, (∇x)
2
i, j
with
(∇x)
1
i, j =
xi+1, j − xi, j
if i < m,
0 if i = m,
(∇x)
2
i, j =
xi, j+1 − xi, j
if j < m,
0 if i = m.
Then, the discrete Total Variation (TV) is defined as
kxkTV :=
P
1≤i, j≤m k(∇x)i, jk2 called isotropic
P
1≤i, j≤m |(∇x)
1
i, j
| + |(∇x)
2
i, j
| called anisotropic
Even though the isotropic case presents a slightly more difficult computational challenge, it has the advantage of being unaffected by
the direction of the edges in images.
The TV-norm can be seen as the `1-norm of the absolute values of the gradients evaluated at each pixel. Analogous to the case of
`1-norm, minimizing the TV-norm subject to data fidelity leads to sparse gradients, which implies an image with many flat regions
and few sharp transitions; see Figure ??.
8https://pywavelets.readthedocs.io/en/latest/index.html
6
LIONS @ EPFL Prof. Volkan Cevher
2.3 Image in-painting
Image in-painting consists in reconstructing the missing parts of an image. By exploiting the structures such as sparsity over the
wavelet coefficients, it is possible to in-paint images with a large portion of their pixels missing. In this part of the homework, we are
going to study the convergence of different methods related to FISTA, as well as different restart strategies. We consider a subsampled
image b = PΩx, where PΩ ∈ R
n×p
is an operator that selects only few, n p := m
2
, pixels from the vectorized image x ∈ R
p
. We can try
to reconstruct the original image x by solving the following problems.
min
α∈Rp
1
2
kb − PΩWTαk
2
2
| {z }
f`1
(α)
+ λ`1
kαk1
| {z }
g`1
(α)
, (11)
min
x∈Rp
(1/2)kb − PΩxk
2
2 | {z }
fTV(x)
+ λTVkxkTV | {z }
gTV(x)
. (12)
Here, the operator WT
is vectorized to be dimension compatible with PΩ. An example of image in-painting with this model is given
in Figure 3.
EXERCISE 2.1: (5 POINTS) –
a) Write the gradient of f(α) in (11) and f(x) in (12).
b) Find the Lipschitz constant of ∇α f(α) for (11) and ∇x f(x) in (12) analytically.
Figure 3: An example of image in-painting with the three models described in this homework. The regularization parameter has been
set to 0.01 for all methods.
EXERCISE 2.2: (10 POINTS) – Adapt the FISTA algorithm from Exercise II.1 to solve the problems (11) and (12).
a) Take a 256 × 256 natural image or better, a picture of you, and subsample 40% of its pixels uniformly at random (i.e. remove 60% of the
pixels).
b) Perform a parameter sweep over λ1 and λTV. You should choose a meaningful range for the parameters, one that allows you to observe the
behavior of the PSNR as a function of the regularization parameters. Use 200 iterations for your algorithm.
c) Plot the obtained PSNR against the regularization parameter values.
d) Provide your codes with comments, and include the reconstructed images.
EXERCISE 2.3: (10 POINTS) – You will now study the convergence to x
∗
, the optimal solution of problem (11). We use use λ = 0.01 and again
uniformly subsample 40% of the pixels.
a) Adapt (or reuse) the ISTA variations you implemented in Exercise II.1:
• ISTA
• FISTA
• FISTA with gradient scheme restart (see Exercise 1.II.1)
7
LIONS @ EPFL Prof. Volkan Cevher
Figure 4: Illustration of the unrolled proximal method. The function g(·) represents the gradient step x
k − αk∇f(x
k
), y represents the
measurements b, while the function Pψ(·) (corresponding to gθ
in text) is the learned proximal step. The values xi are then the results
of the different iterations. Taken from [6].
b) The optimal solution x
∗
of problem (11) and the corresponding optimal value F
∗
:= f(x
∗
)+g(x
∗
) are not known a priori. You need to estimate
them by running FISTA with gradient scheme restart for 5000 iterations and use the relative error rerrk < 10−15 () as the stopping criterion.
The error at iteration k is defined as
errk = ky
k+1 − y
k
k2, rerrk
:=
errk
err0
.
Write the value that you obtained for F
∗
in the report.
c) Study the convergence of ISTA, FISTA, FISTA with gradient scheme restart over 2000 iterations. Adapt the tolerance criterion to stop
iterating when |F(x
k
)−F
∗
|
F∗ < 10−15. Plot a graph log
|F(x
k
)−F
∗
|
F∗
against k (cf. the examples in Lecture 10) and comment on the different rates
and behaviors observed.
HINT: You can use the function semilogy for the plot.
d) Replace then the F
∗ with F
:= F(x
), where x
is your ground-truth image. Plot a graph log
|F(x
k
) − F
|/F
against k using the same
algorithms as in 3. for 1000 iterations. Comment on the differences.
e) Provide your codes with comments, as well as both convergence plots with discussion on the results.
Comparison with algorithm unrolling. In recent years, the interest has grown for deep learning method that learn to imitate and
improve onto algorithms like ISTA of FISTA for composite convex minimization. Given un update rule for a composite minimization
problem of form of (12), i.e. x
k+1 = proxg(x
k − αk∇f(x
k
)) , the approach of [6] proposes to learn the proximal operator with a neural
network.
They fix the number of iterations that they will use to T and alternate between the inner updates (x
k − αk∇f(x
k
)) and the proximal
operator parametrized by a neural network (we will call it gθ), as shown on figure 4. On Figure 4, one starts from the undersampled
image x0 and iterate T times according to the rule x
k+1 = proxgθ
(x
k −αk∇f(x
k
)). The network gθ
is then trained to minimize Ex
[kxT −x
k],
i.e. to minimize the average `2 distance between natural images x
and reconstructed images xT , starting from the measured data
x0 = b = PΩx
. Their result show a drastic improvement upon some classically used methods such as Wavelet minimization (Problem
(11)).
We have implemented and pretrained the method for inpainting images of size 256 × 256 and we would like you to explore compare
your implementations to the method. An example on how to use it is provided in the script example_unrolled_nn.py, which
implements T = 5 unrolled step with an operator parametrized by a simple residual network.
EXERCISE 2.4: (5 POINTS) – Compare the reconstruction performance and speed of reconstruction for FISTA with gradient scheme restart for
`1 and TV minimization.
a) Compare the performance after 500 reconstruction steps using `1 and TV minimization against the unrolled method and report the PSNR
and reconstruction time, as well as the images reconstructed and error maps (|x − xˆ|).
b) Compare the performance after T = 5 iterations of all three methods, and report the PSNR and reconstruction time as well as the images
reconstructed and error maps (|x − xˆ|). Comment on your results.
8
LIONS @ EPFL Prof. Volkan Cevher
c) Comment on the results. Which image appears the best to you, and why? Identify a tradeoff between the unrolled method and the classical
iterative approaches.
N.B. If your image yields poor reconstruction results with the unrolled CNN, try running it with of the provided images, such as gandalf.jpg.
3 Guidelines for the preparation and the submission of the homework
Work on your own. Do not copy or distribute your codes to other students in the class. Do not reuse any other code related to this
homework. Here are few warnings and suggestions for you to prepare and submit your homework.
• This homework is due at 4:00PM, 6 December, 2019.
• Submit your work before the due date. Late submissions are not allowed and you will get 0 point from this homework if you
submit it after the deadline.
• Questions of 0 point are for self study. You do not need to answer them in the report.
• Your final report should include detailed answers and it needs to be submitted in PDF format.
• The PDF file can be a scan or a photo. Make sure that it is eligible.
• The results of your simulations should be presented in the final report with clear explanation and comparison evaluation.
• We provide some Python scripts that you can use to implement the algorithms, but you can implement them from scratch using
any other convenient programming tool (in this case, you should also write the codes to time your algorithm and to evaluate
their efficiency by plotting necessary graphs).
• Even if you use the Python scripts that we provide, you are responsible for the entire code you submit. Apart from completing
the missing parts in the scripts, you might need to change some written parts and parameters as well, depending on your
implementations.
• The codes should be well-documented and should work properly. Make sure that your code runs without error. If the code you
submit does not run, you will not be able to get any credits from the related exercises.
• Compress your codes and your final report into a single ZIP file, name it as ee556hw3_NameSurname.zip, and submit it
through the moodle page of the course.
References
[1] CHAMBOLLE, A. An algorithm for total variation minimization and applications. J. Math. Imaging Vis. 20 (2004), 89–97.
[2] GISELSSON, P., AND BOYD, S. Monotonicity and restart in fast gradient methods. In Decision and Control (CDC), 2014 IEEE 53rd
Annual Conference on (2014), IEEE, pp. 5058–5063.
[3] JORDAN, M. I., ET AL. Why the logistic function? a tutorial discussion on probabilities and neural networks, 1995.
[4] LECUN, Y., AND CORTES, C. MNIST handwritten digit database.
[5] MALLAT, S. A wavelet tour of signal processing. Academic press, 1999.
[6] MARDANI, M., SUN, Q., DONOHO, D., PAPYAN, V., MONAJEMI, H., VASANAWALA, S., AND PAULY, J. Neural proximal gradient
descent for compressive imaging. In Advances in Neural Information Processing Systems (2018), pp. 9573–9583.
[7] O’DONOGHUE, B., AND CANDES, E. Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics
15, 3 (2015), 715–732.
[8] RUDIN, L., OSHER, S., AND FATEMI, E. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena
60, 1-4 (1992), 259–268.
[9] SCHACKE, K. On the kronecker product.
APPENDIX
9
LIONS @ EPFL Prof. Volkan Cevher
A Deriving the Hessian of multinomial logistic function
Differentiating expression (1) a second time with respect to X gives us a 3D tensor as the Hessian, which is cumbersome to work with.
Luckily, we can avoid this by differentiating f with respect to a vectorized version of X, thus imposing a regular matrix form on ∇
2
f .
Our Hessian will now belong to R
dC×dC. The vectorized X and ∇Xvec
f corresponding to it are given below:
Xvec =
x11
.
.
.
xd1
.
.
.
x1C
.
.
.
xdC
∈ R
dc ∇Xvec
f(X) =
Xn
i=1
1
PC
k=1
e
a
T
i
xk
ai1e
a
T
i
x1
.
.
.
aide
a
T
i
x1
.
.
.
ai1e
a
T
i
xC
.
.
.
aide
a
T
i
xC
−
ai1 11(bi)
.
.
.
aid 11(bi)
.
.
.
ai1 1C(bi)
.
.
.
aid 1C(bi)
=
Xn
i=1
e
a
T
i
x1
PC
k=1
e
a
T
i
xk
.
.
.
e
a
T
i
xC
PC
k=1
e
a
T
i
xk
⊗
ai1
.
.
.
aid
− yi ⊗
ai1
.
.
.
aid
=
Xn
i=1
e
a
T
i
x1
PC
k=1
e
a
T
i
xk
.
.
.
e
a
T
i
xC
PC
k=1
e
a
T
i
xk
⊗ ai − yi ⊗ ai
,
(13)
where 1j(bi) =
1, if bi = j
0, otherwise
is the indicator function. We can now differentiate ∇Xvec
f(X) with respect to Xvec to obtain the Hessian.
Note that the second term within the sum does not depend on Xvec and will therefore not appear in ∇
2
Xvec
f(X). We make a further
simplification and only consider one data point, as the final expression will be the summation over the data. To ease notation, we drop
the i-index from ai and denote as σj
:=
e
a
T x j
PC
k=1
e
a
T xk
. Finally, we observe that ∂σj
∂xl j
= alσj(1 − σj) and ∂σj
∂xlp
= −alσjσp.
Using the new conventions and differentiating a second time with respect to Xvec, we have:
∂
e
a
T x1
PC
k=1
e
a
T xk
.
.
.
e
a
T xC
PC
k=1
e
a
T xk
⊗ a
∂Xvec
=
∂
σ1
.
.
.
σC
⊗ a
∂Xvec
=
∂
σ1
.
.
.
σC
∂Xvec
⊗ a (14)
We address the first term of the Kronecker product separately:
∂
σ1
.
.
.
σC
∂Xvec
=
a1σ1(1 − σ1) a2σ1(1 − σ1) . . . adσ1(1 − σ1) −a1σ1σ2 −a2σ1σ2 . . . − adσ1σ2 . . . − adσ1σC
−a1σ2σ1 −a2σ2σ1 . . . − adσ2σ1 a1σ2(1 − σ2) a2σ2(1 − σ2) . . . adσ2(1 − σ2) . . . − adσ2σC
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−a1σCσ1 −a2σCσ1 . . . − adσCσ1 −a1σCσ2 −a2σCσ2 . . . − adσCσ2 . . . adσC(1 − σC)
(15)
=
σ1(1 − σ1) −σ1σ2 −σ1σ3 . . . −σ1σC
−σ2σ1 σ2(1 − σ2) −σ2σ3 . . . −σ2σC
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−σCσ1 −σCσ2 −σCσ3 . . . σC(1 − σC)
| {z }
Σ
⊗ a
T
(16)
Finally we have:
∂
σ1
.
.
.
σC
⊗ a
∂Xvec
=
Σ ⊗ a
T
⊗ a (17)
= Σ ⊗
a
T ⊗ a
(18)
= Σ ⊗ aaT
(19)
(20)
A simple summation over the data retrieves the expression of the Hessian given in (5).
10
Reviews
There are no reviews yet.