, , , , ,

[SOLVED] Ee-556 homework-4 (for weeks 13–14)

$25

File Name: Ee_556_homework_4__for_weeks_13___14_.zip
File Size: 348.54 KB

5/5 - (1 vote)

1 Projection free convex low-rank matrix optimization
Consider the matrix optimization problem:
X
?
∈ arg min
X

f(X) : rank(X) ≤ r, X ∈ R
p×m

,
for some convex objective function f , where r < min{p, m} forces the solution to be low-rank. Such an optimization problem appears in
many real-world applications—you will see two examples in the second and third parts of this homework. However, the matrix rank
function is discrete-valued and non-convex, so solving this optimization problem is hard (NP-hard in general).
Convex relaxation methods are extensively studied for solving low-rank matrix optimization problems. As pointed out in [5], the
nuclear norm is an effective convex surrogate to obtain (approximately) low-rank solutions:
X
?
∈ arg min
X

f(X) : kXk∗ ≤ κ, X ∈ R
p×m

.
When projection onto the nuclear norm ball is easy to perform, we can solve this problem by projected gradient methods. However,
as the problem dimensions increase, the projection becomes a computational bottleneck.
PROBLEM 1.1: PROJECTION ONTO THE NUCLEAR NORM BALL (15 Points)
(a) (10 Points) Let Z = UΣV
> be the singular value decomposition of Z ∈ R
p×m
. Denote the diagonal of Σ ∈ R
s×s by a vector σ ∈ R
s
,
where s = min{p, m}. Let σ
`1 be the projection of σ onto the `1-norm ball {x : x ∈ R
s
, kxk1 ≤ κ}. Show that the projection of this
matrix onto the nuclear norm ball X = {X : X ∈ R
p×m
, kXk∗ ≤ κ} can be computed by projecting σ onto the `1 norm ball, i.e.,
projX
(Z) = UΣ
`1V
>
where Σ
`1 ∈ R
s×s denotes the diagonal matrix with diagonal σ
`1
.
(Hint: Use Mirsky’s inequality: kX − ZkF ≥ kΣX − ΣZkF, where ΣX, ΣZ ∈ R
s×s are the diagonal matrices of the singular values of
X, Z respectively.)
(b) (5 Points) Implement the projection operator as a function called projNuc. You can use the provided file projL1.
Set κ = 5000 and measure the computation time of the projection operator with the 100k and 1M MovieLens datasets. You can do
this by running the script Pr11b, which loads the datasets, constructs the data matrix, and times the evaluation of the projection
operator. Write the values you get in your report. Run your script at least for 5 times and report the average timings.
These datasets consist of the ratings given by MovieLens users to movies in a given list. Of course, a user may not rate all of the
movies. Modeling the ratings as entries of a low-rank matrix, where one dimension corresponds to different users, and the other
corresponds to different movies, the goal of low-rank matrix completion is to predict the users’ ratings on unrated movies. This
is essential to building a movie recommendation system.
The 100k dataset consists of 100,000 ratings from 1000 users on 1700 movies. The 1M dataset consists of 1 million ratings from
6000 users on 4000 movies. Note that both of these datasets are relatively small; for instance, the famous Netflix competition
data consists of 100480507 ratings that 480189 users gave to 17770 movies. Projecting a matrix of this size would require a huge
amount of time.
LIONS @ EPFL Prof. Volkan Cevher
Problem 1.1 shows that projection onto the nuclear norm ball requires computing the singular value decomposition. The computational complexity of the singular value decomposition is O(min(m
2 p, mp2
)), and this can easily become a computational bottleneck if
m and p are large. This increased the popularity of algorithms that leverage the linear minimization oracle (lmo) instead (e.g., [2, 7]):
lmoX(Z) = arg min
X∈X
hX, Zi where hX, Zi = Tr(Z
> X).
Note that lmoX(Z) is not single valued in general. With abuse of terminology, when we say that we compute the lmo, we actually
mean that we compute an instance X such that X ∈ lmoX(Z).
PROBLEM 1.2: LMO OF NUCLEAR NORM (15 Points)
(a) (10 Points) Show that the lmoX when X is the nuclear norm ball: X = {X : X ∈ R
p×m
, kXk∗ ≤ κ} gives the following output:
−κuvT
∈ lmoX(Z),
where u and v are the left and right singular vectors that correspond to the largest singular value of Z.
(Hint: By definition κuvT ∈ X. You just need to show hX, Zi ≥ h−κuvT
, Zi for all X ∈ X.)
(b) (5 Points) Implement the lmo with X as a function called lmoNuc. Set κ = 5000 and measure the computation time for the 100k
and 1M Movielens datasets. You can do so by running the script Pr12b, which loads the datasets, constructs the data matrix,
and times the evaluation of the lmo.
Write the values you get in your report. Run your script at least for 5 times and report the average timings. Compare these
values with the computation time of the projection operator from Problem 1.1.b.
2 Hands-on experiment 1: Crime Scene Investigation with Blind Deconvolution
You are working with the police. You are asked to identify the license plate of a car involved in a crime scene investigation. Unfortunately, the CCTV image of the car is blurry. In this part, we simulate this scene by trying to deblur a plate image found from the
internet1
.
This is an instance of the blind deconvolution problem. The set-up of the problem is as follows: Given two unknown vectors
x,w ∈ R
L
, we observe their circular convolution y = w ∗ x, i.e.,
y` =
XL
`
0=1
w`
0 x`−`
0+1
where the index ` − `
0 + 1 in the sum above is understood to be modulo L. The goal of blind deconvolution is to separate w and x.
The keyword blind comes from the fact that we do not know much priori information about the signals. We simply assume w and
x belong to known subspaces of R
L of dimension K and N, i.e., we can write
w = Bh, h ∈ R
K
x = Cm, m ∈ R
N
for some L × K matrix B and L × N matrix C. The columns of these matrices provide bases for the subspaces in which w and x live.
In the image deblurring example, x corresponds to the image we want to recover, w to a 2D blur kernel. We assume that we know
(or have an estimate of) the support (i.e., the location of nonzeros) of the blur kernel. In real applications, support can be estimated by
an expert using the physical information such as the distance of object to the focus and the camera, the speed of the camera and/or
the object, camera shutter speed, etc.
In this experiment, we use a very rough estimate for the support and simply use a box at the center of the domain. We roughly
tuned the size of the box. Interestingly, it is possible to make the plate readable even with this very rough estimate.
The 2D convolution y = w∗x produces a blurred image. As we have seen in the last homework, natural images have sparse wavelet
expansions. Hence, the image x can be expressed as x = Cm with C is the matrix formed by a subset of the columns of the wavelet
transform matrix. In addition, the blur kernel w can be written as w = Bh with B is the matrix formed by a subset of the columns of
the identity matrix.
1https://plus.maths.org/issue37/features/budd/blurredplate.jpg
2
LIONS @ EPFL Prof. Volkan Cevher
Blurred image of licence plate y Estimate of the support of blur kernel w
PROBLEM 2.1: FRANK-WOLFE FOR BLIND DECONVOLUTION (25 POINTS)
Let b be the L-point normalized discrete Fourier transform (DFT) of the observation y, i.e, b = Fy, where F is the DFT matrix.
Then, b can be written as b = A(X) where X = hm> and A is a linear operator. Explicit expression of this linear operator A is out of the
scope of this homework, c.f., [1] for further details. This reformulation allows us to express y, which is a nonlinear combination of the
coefficients of h and m, as a linear combination of the entries of their outer product X = hm>
. Note that given B and C, recovering m
and h from b is the same as recovering x and w from y.
Since X is a rank one matrix, we can use the nuclear norm to enforce approximately low-rank solutions. Then, we can formulate
the blind deconvolution problem as follows:
X
?
∈ arg min
X

1
2
kA(X) − bk
2
2
: kXk∗ ≤ κ, X ∈ R
p×m

. (1)
Here κ > 0 is a tuning parameter.
We will apply the Frank-Wolfe algorithm to solve the optimization problem given in (1). The Frank-Wolfe algorithm is one of the
earliest algorithms that avoids the proximal operator. Instead, it leverages the lmo [2]:
lmo(∇f(Z)) = arg min
X∈X
h∇f(Z), Xi,
where X = {X : kXk∗ ≤ κ, X ∈ R
p×m
} as in Part 1. It applies to the generic constrained minimization template with a smooth objective
function, minX{f(X) : X ∈ X} as follows:
Frank-Wolfe’s algorithm
1. Choose X
0
∈ X.
2. For k = 0, 1, . . . perform:



Xˆ k
:= lmo(∇f(X
k
)),
X
k+1
:= (1 − γk)X
k + γkXˆ k
,
where γk
:= 2/(k + 2).
(a) (5 Points) Recall that the Frank-Wolfe algorithm applies only for smooth objectives. Show that the objective function is smooth,
in the sense its gradient is Lipschitz continuous.
(b) (20 Points) Complete the missing lines in the FrankWolfe function to implement the algorithm. We provide you the linear
operators that you need to compute the lmo in the code. Note that we do not need to store and use the linear operator A in the
ambient dimensions. In fact, for storage and arithmetic efficiency, we should avoid forming A. You can find more details about
this aspect as comments in the code.
Test your implementation using the script Test_deblur. Tune the parameter κ until the license plate number becomes readable.
What is the license plate number? You can also tune the support of the blur kernel and try to get better approximations.
3 Hands-on experiment 2: k-means Clustering by Semidefinite Programming
Clustering is an unsupervised machine learning problem in which we try to partition a given data set into k subsets based on distance
between data points or similarity among them. The goal is to find k centers and to assign each data point to one of the centers such
that the sum of the square distances between them are minimal [4]. This problem is known to be NP-hard.
Problem Definition: Given a set of n points in a d−dimensional Euclidean space, denoted by
S = {si = (si1, · · · , sid)
>
∈ R
d
i = 1, · · · , n}
3
LIONS @ EPFL Prof. Volkan Cevher
find an assignment of the n points into k disjoint clusters S = (S 1, · · · , S k) whose centers are cj(j = 1, · · · , k) based on the total sum of
squared Euclidean distances from each point si to its assigned cluster centroid ci
, i.e.,
f(S,S) =
Xk
j=1
|S j X
|
i=1
ks
j
i − cjk
2
,
where |S j
| is the number of points in S j
, and s
j
i
is the i
th point in S j
.
[4] proposes an SDP-relaxation to approximately solve the abovementioned model-free k−means clustering problem . The resulting
optimization problem2
takes the standard semidefinite programming form
X
?
∈ arg min
X

hC, Xi : X1 = 1 | {z }
A1(X)=b1
, X
>
1 = 1 | {z }
A2(X)=b2
, X ≥ 0
|{z}
B(X)∈K
, Tr(X) ≤ κ, X ∈ R
p×p
, X  0
| {z }
X

, (2)
where C ∈ R
p×p
is the Euclidean distance matrix between the data points. Tr(X) ≤ κ enforces approximately low-rank solutions, the
linear inclusion constraint X ≥ 0 is element-wise nonnegativity of X, the linear equality constraints X1 = 1 and X
>
1 = 1 require row
and column sums of X to be equal to 1’s, and X  0 means that X is positive semi-definite. Recall that Tr(X) = kXk∗ for any positive
semi-definite matrix X.
Algorithm: Program in (2) can be reformulated as
min
x∈X
f(x) + g1(A1(x)) + g2(A2(x)),
subject to B(x) ∈ K (3)
where f(x) = hC, xi is a smooth convex function, g1 is the indicator function of singleton {b1}, g2 is the indicator function of singleton
{b2} and K is the positive orthant for which computing the projection is easy.
Note that the classical Frank-Wolfe method does not apply to this problem due to nonsmooth terms g1, g2. In the sequel, we will
attempt to solve this problem with the HomotopyCGM method proposed in [7] to handle the non-smooth problems with a conditional
gradient based method. The algorithm to solve the problem in (3) is given below.
Conditional Gradient Method(CGM)
for composite problems
1. Choose x
0 ∈ X and β0 > 0
2. For k = 1, 2, . . . perform:



γk = 2/(k + 1), and βk = β0/

k + 1
vk = βk∇f(xk) + A
>
1
(A1 xk − b1) + A
>
2
(A2 xk − b2) + (xk − projK
(xk))

k
:= argminx∈X hvk
, xi
x
k+1
:= (1 − γk)x
k + γk xˆ
k
3.Output: x
k
Dataset: We use a similar setup described by [3]. We use the fashion-MNIST data in [6] which is released as a possible replacement
for the MNIST handwritten digits . Each data point is a 28×28 grayscale image, associated with a label from 10 classes. Figure 3 depicts
3 row samples from each class. Classes are labeled from 0 to 9. First, we extract the meaningful features from this dataset using a
simple 2 layers neural network with a sigmoid activation. Then, we apply neural network to 1000 test samples from the same dataset,
which gives us a vector µ ∈ R
10 where each entry represents the probability being in that class. Then, we form the pairwise distance
matrix C by using this probability vectors.3
PROBLEM 3.1: CONDITIONAL GRADIENT METHOD FOR CLUSTERING FASHION-MNIST DATA (45 POINTS)
(a) (10 Points) Show that the domain X = {X : Tr(X) ≤ κ, X ∈ C
p×p
, X  0} is a convex set.
2See section (2) of [4] for details of this relaxation.
3
In the code, you do not need to worry about any of the processing details mentioned here. You are directly given the matrix C.
4
LIONS @ EPFL Prof. Volkan Cevher
T-shirt/top
Trouser
Pullover
Dress
Coat
Sandal
Shirt
Sneaker
Bag
Ankle boot
Figure 1: Fashion MNIST data
(b) (10 Points) Given a linear inclusion constraint T x ∈ Y, the corresponding quadratic penalty function is given by
QPY(x) = dist2
(T x, Y) = min
y∈Y
ky − T xk
2
.
Write down the constraints in (3) in the quadratic penalty form and show that the penalized objective takes the form
f(x) +
1

kA1(x) − b1k
2 +
1

kA2(x) − b2k
2 +
1

dist2
(x, K),
and show that the gradient of the penalized objective is equal to vk/β in the algorithm.
(Hint: You can write dist2
(T x, Y) = ky
∗ − T xk
2
, where y
∗ = arg miny∈Y ky − T xk
2
. and take the derivative with respect to X without
worrying about y
∗ depending on X, thanks to Danskin’s theorem 4
.)
(c) (5 Points) Write down vk explicitly by deriving the gradient and projection specific to Problem (3).
(d) (20 points) Complete the missing lines in the function definition of HomotopyCGM, and run the Test_clustering to solve
the k-means clustering problem. Using the function value_kmeans, compute and report the k-means value before and after
4https://en.wikipedia.org/wiki/Danskin’s_theorem
5
LIONS @ EPFL Prof. Volkan Cevher
running the algorithm. Plot the results and SDP solution X after 5000 iterations. What is the final objective value? Is it below the
optimal value provided to you? If yes, explain the reason.
(Hint: Note that when X is as given in (2), κuu> ∈ lmoX(X), where u is the eigenvector corresponding to the smallest eigenvalue
of X.)
4 Guidelines for the preparation and the submission of the homework
Work on your own. Do not copy or distribute your codes to other students in the class. Do not reuse any other code related to this
homework. Here are few warnings and suggestions for you to prepare and submit your homework.
• This homework is due at 09:00AM, 6 January, 2020
• Submit your work before the due date. Late submissions are not allowed and you will get 0 point from this homework if you
submit it after the deadline.
• Questions of 0 point are for self study. You do not need to answer them in the report.
• Your final report should include detailed answers and it needs to be submitted in PDF format.
• The PDF file can be a scan or a photo. Make sure that it is eligible.
• The results of your simulations should be presented in the final report with clear explanation and comparison evaluation.
• We provide some Python scripts that you can use to implement the algorithms, but you can implement them from scratch using
any other convenient programming tool (in this case, you should also write the codes to time your algorithm and to evaluate
their efficiency by plotting necessary graphs).
• Even if you use the Python scripts that we provide, you are responsible for the entire code you submit. Apart from completing
the missing parts in the scripts, you might need to change some written parts and parameters as well, depending on your
implementations.
• The codes should be well-documented and should work properly. Make sure that your code runs without error. If the code you
submit does not run, you will not be able to get any credits from the related exercises.
• Compress your codes and your final report into a single ZIP file, name it as ee556_2019_hw4_NameSurname.zip, and submit
it through the moodle page of the course.
References
[1] AHMED, A., RECHT, B., AND ROMBERG, J. Blind deconvolution using convex programming. IEEE Transactions on Information
Theory 60, 3 (2014), 1711–1732.
[2] JAGGI, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proc. 30th Int. Conf. Machine Learning (2013).
[3] MIXON, D. G., VILLAR, S., AND WARD, R. Clustering subgaussian mixtures by semidefinite programming. arXiv preprint
arXiv:1602.06612 (2016).
[4] PENG, J., AND WEI, Y. Approximating k-means-type clustering via semidefinite programming. SIAM journal on optimization 18, 1
(2007), 186–205.
[5] RECHT, B., FAZEL, M., AND PARRILO, P. A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm
minimization. SIAM Rev. 52, 3 (2010), 471–501.
[6] XIAO, H., RASUL, K., AND VOLLGRAF, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,
2017.
[7] YURTSEVER, A., FERCOQ, O., LOCATELLO, F., AND CEVHER, V. A conditional gradient framework for composite convex minimization with applications to semidefinite programming. arXiv preprint arXiv:1804.08544 (2018).

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Ee-556 homework-4 (for weeks 13–14)[SOLVED] Ee-556 homework-4 (for weeks 13–14)
$25