[SOLVED] 机器学习代写:cs 189 HW4

30 $

File Name: 机器学习代写:cs_189_HW4.zip
File Size: 292.02 KB

SKU: 0143546145 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


CS 189 Introduction to Machine Learning Fall 2017

This homework is due Friday, September 22 at 10pm. 1 Getting Started

HW4

You may typeset your homework in latex or submit neatly handwritten and scanned solutions. Please make sure to start each question on a new page, as grading (with Gradescope) is much easier that way! Deliverables:

1. Submit a PDF of your writeup to assignment on Gradescope, “HW[n] Write-Up” 2. Submit your test set evaluation results, “HW[n] Test Set”.

After you’ve submitted your homework, be sure to watch out for the self-grade form.

(a) Before you start your homework, write down your team. Who else did you work with on this homework? List names and email addresses. In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?

(b) Please copy the following statement and sign next to it:

I certify that all solutions are entirely in my words and that I have not looked at another student’s solutions. I have credited all external sources in this write up.

CS 189, Fall 2017, HW4 1

2 MLE of Multivariate Gaussian

In lecture, we discussed uses of the multivariate Gaussian distribution. We just assumed that we knew the parameters of the distribution (the mean vector μ and covariance matrix Σ). In practice, though, we will often want to estimate μ and Σ from data. (This will come up even beyond regression-type problems: for example, when we want to use Gaussian models for classification problems.) This problem asks you to derive the Maximum Likelihood Estimate for the mean and variance of a multivariate Gaussian distribution.

(a) Let X have a multivariate Gaussian distribution with mean μ ∈ Rd and covariance matrix Σ ∈ Rd×d. Write the log likelihood of drawing the n i.i.d. samples x1,…,xn ∈ Rd from X given Σ and μ.

Solution: First, we calculate the likelihood by separating out the product: P(x1,x2,…,xn|μ,Σ) = P(x1|μ,Σ)P(x2|μ,Σ)…P(xn|μ,Σ)

= ∏ 1 exp{−1(xi − μ)T Σ−1(xi − μ)}
i (2π)1/2|Σ|1/2 2 (1)

11n
= (2π)n/2|Σ|n/2 exp{−2 ∑(xi − μ)T Σ−1(xi − μ)}

i=1

Then we take the log:
lnP(x1,x2,…,xn|μ,Σ)=−2log2π−2log|Σ|−2 ∑(xi−μ)TΣ−1(xi−μ) (2)

nn1n i=1

(b) Find MLE of μ and Σ. For taking derivatives with respect to matrices, you may use any formula in “The Matrix Cookbook” without proof. This is a reasonably involved problem part with lots of steps to get to the answer. We recommend students first do the one-dimensional case and then the two-dimensional case to warm up.

Solution: We find the MLE of the mean by differentiating with respect to μ, which requires using Equation 83 from The Matrix Cookbook.

Equation 83 tells us:
∂ (xi−μ)TΣ−1(xi−μ)=(Σ−1+Σ−T)(xi−μ)(−1)=−2Σ−1(xi−μ).

∂μ

CS 189, Fall 2017, HW4 2

Therefore,

∂ ∂1n
∂μ lnP(μ,Σ|x1,x2,…,xn)= ∂μ −2 ∑(xi−μ)TΣ−1(xi−μ)

We set the expression equal to 0:

i=1
=−1∑n ∂ (xi−μ)TΣ−1(xi−μ)

2 i=1 ∂ μ 1n −1

(3)

=−2∑−2Σ (xi−μ) i=1

n
= ∑Σ−1(xi −μ)

i=1

n n n ∑nxi ∑Σ−1(xi−μ)=0=⇒∑(xi−μ)=0=⇒∑xi=nμ=⇒μ= i=1 . i=1 i=1 i=1 n

We find the MLE of the variance by differentiating with respect to Σ, which requires using Equations 57 and 61 from The Matrix Cookbook.

Equation 57 tells us:

∂ log |Σ| = Σ−T ∂Σ

Equation 61 tells us:
∂((xi−μ)TΣ−1(xi−μ)) =−Σ−T(xi−μ)(xi−μ)TΣ−T

∂Σ

Therefore, ∂∂n1n

∂ΣlnP(μ,Σ|x1,x2,…,xn)= ∂Σ(−2log|Σ|−2 ∑(xi−μ)TΣ−1(xi−μ)) i=1

=−n ∂ (log|Σ|)−1∑n ∂ ((xi−μ)TΣ−1(xi−μ)) (4) 2 ∂ Σ 2 i=1 ∂ Σ

n−T1n −T T−T =−2Σ −2∑−Σ (xi−μ)(xi−μ) Σ

i=1
We set the expression equal to 0. Remember that Σ is symmetric, so Σ = ΣT :

n−T1n −T T−T 0=−2Σ −2∑−Σ (xi−μ)(xi−μ) Σ

i=1 n

=⇒ 0 = n + ∑ − ( x i − μ ) ( x i − μ ) T Σ − 1 i=1

n
=⇒ ∑ ( x i − μ ) ( x i − μ ) T = n Σ

i=1

1n
=⇒ n ∑ ( x i − μ ) ( x i − μ ) T = Σ

i=1

(5)

CS 189, Fall 2017, HW4

3

(c)

Use the following code to sample from a two-dimensional Multivariate Gaussian and plot the samples:

import numpy as npimport matplotlib.pyplot as pltmu = [15, 5]sigma = [[20, 0], [0, 10]]samples = np.random.multivariate_normal(mu, sigma, size=100)plt.scatter(samples[:, 0], samples[:, 1])plt.show()

Try the following three values of Σ:
Σ = [[20, 0], [0, 10]]; Σ = [[20, 14], [14, 10]]; Σ = [[20, −14], [−14, 10]].

Calculate the mean and variance of these distributions from the samples (that is, imple- ment part (b)). Report your results. Include your code in your write-up.

Solution: There is no “correct” answer for the sample mean and variance, since it depends on randomness, but hopefully you got values for the MLE mean and variance that were close to the true mean and variance. If they are very far off, this may indicate an error in your code. The following code works by computing the formula above directly (including the matrix multiplication):

sample_mean = np.average(samples, axis=0)sample_variance = 0.01 * np.sum(np.matrix((samples[i]-sample_mean)).T
np.matrix((samples[i]-sample_mean))for i in range(100))

Tikhonov Regularization and Weighted Least Squares

3

In this problem, we view Tikhonov regularization from a probabilistic standpoint.

In lecture, you have seen this worked out in one way. In an earlier homework, we introduced Tikhonov regularization as a generalization of ridge regression. In this problem, you are mainly going to use the computer to help deepen your understanding of how priors and thus the right regularization can help.

(a) Let X ∈ Rd be a d-dimensional random vector and Y ∈ R be a one-dimensional random vari- able. AssumealinearmodelbetweenX andY:Y =wTX+zwherez∈R,andw∈Rd. Also assume that z ∼ N(0,1) is a Gaussian random variable. Assume w ∼ N(0,Σ) where Σ is a symmetric positive definite matrix and this random variable is independent of the observation noise. What is the conditional distribution of Y given X and w?

CS 189, Fall 2017, HW4 4

*

Solution: A Gaussian random variable plus a constant is a Gaussian random variable with the mean having that constant added to it while the random variable keeps the the same variance, so Y |X , w ∼ N (X T w, 1). Concretely, we have

11T2
f(Y|X)=√ exp{−2(Y−X w)}, (6)

the pdf of Gaussian distribution.

(b) (Tikhonov regularization) Given n training data points {(X1,Y1),(X2,Y2),…,(Xn,Yn)} gener- ated according to the model of (X , Y ) in the previous part (the w is common, but the observation noise varies across the different training points. The Xi are distinct and are arbitrary.), derive the posterior distribution of w given the training data. Note: w and Y = (Y1,Y2,…,Yn) are jointly Gaussian in this problem given the Xi.

Hint: (You may or may not find this useful) If the probability density function of a random

variable is of the form

f(X)=C·exp{−1XTAX+bTX}, (7) 2

where C is some constant to make f(X) integrates to 1, then the mean of X is A−1b. This can be used to help complete squares if you choose to go that way.

This case was derived in lecture, and you are free to replicate the derivation from lecture in your own language. You are also free to give a different derivation from first principles if you like.

Solution:

By Bayes’ Theorem, we have f(w|X1,Y1,···,Xn,Yn)∝[∏f(Yi|Xi,w)]f(w)

n i=1

n 1T2 wTΣ−1w ∝ [∏exp{−2(Yi −Xi w) }]exp{− 2 }

i=1
∝ [exp{− 2 ∑ (Yi − XiT w)2 }] exp{− 2 }

1 n wTΣ−1w i=1

∝exp{−1[wT(XTX+Σ−1)w−2(XTY)Tw]}, 2

with X ∈ Rn×d whose ith row is XiT and Y ∈ Rn whose ith entry is Yi. In the above derivation, thefirstlinefollowsfromBayes’Theoremandomittingtheconstantterms f(X1,Y1,…,Xn,Yn) in the denominator which don’t depend on w. The second follows from writing out the pdf directly. The fourth line follows from expanding out the squared term.

Therefore the posterior of w given X1,Y1,··· ,Xn,Yn is a multivariate Gaussian N((XT X + Σ−1)−1XTY,(XTX +Σ−1)−1), which follows by using the following lemma.

Lemma: If the probability density function of a random variable is of the form

f(X)=C·exp{−1XTAX+bTX}, (8) 2

CS 189, Fall 2017, HW4

5

where C is some constant to make f (X ) integrate to 1, and A is a symmetric positive definite matrix, then X is distributed as N(A−1b,A−1).

The proof of the lemma is as follows:

f(X)=C·exp{−1XTAX+bTX}, 2

= C1 · exp{− 1 (X − (A−1 b))T A(X − (A−1 b)) − 1 bT A−1 b}, 22

= C2 · exp{− 1 (X − (A−1 b))T A(X − (A−1 b))}, 2

which is the density of a Gaussian random variable with mean A−1b and covariance matrix A−1.

(c) (BONUS but in-scope.) How would you extend your result from the previous part to the case where the observation noise Z was no longer N(0,In) but was instead distributed as N(μz,Σz) for some mean μz and some covariance Σz, but was independent of the parameter w. Give an argument based on appropriately changing coordinates.

Solution: By changing variables, our goal is to somehow reduce to the previous case — with white observation noises.

1 −11/2 DefineVsuchthatZ=μz+Σz2V.ThenV∼N(0,I)andV=Σz 2(Z−μz).(Σz isthematrix

suchthatΣ1/2ΣT/2 =Σ ,whichcanbefoundbytheeigenvaluedecompositionofΣ =U∆UT zzz z

and exploiting the spectral theorem’s guarantee of orthonormal eigenvalues and all positive 11

eigenvaluestoformthediagonalmatrix∆.TakingΣz2 =U∆2 givesuswhatwewant.) −1 −1

ThenΣz 2(Y−μz)=Σz 2Xw+V withV∼N(0,I). Thismakesthereasonableassumption

that the Σz is invertible, since otherwise, there would be some dimension in which there is −1

essentially no noise. So the problem reduces to the previous part by taking Σz 2 X as X and −1

Σz 2(Y−μz)asY,whichyieldstheposteriorofwgivenX1,Y1,…,Xn,Yn asamultivariate GaussianN((XTΣ−1X+Σ−1)−1XTΣ−1(Y−μ ),(XTΣ−1X+Σ−1)−1).

z zzz

(d) (Compare the effect of different priors) Do the following for Σ = Σ1,Σ2,Σ3,Σ4,Σ5,Σ6 respec-

CS 189, Fall 2017, HW4 6

tively, where

1 0Σ1= 0 1

1 0.25Σ2=0.25 1

1 0.9Σ3=0.9 1

1 −0.25Σ4= −0.25 1

1 −0.9Σ5=−0.9 1

0.1 0Σ6= 0 0.1

In the above cases, the priors on the two parameters are independent with large variance, mildly positively correlated, strongly positively correlated, mildly negatively correlated, strongly neg- atively correlated, and then independent with small variances respectively.

Generate 5,50, and 500 data points from Y = X1 +X2 +Z where X1,X2 ∼ N(0,5) and Z ∼ N(0,1) as training data. Plot the contours of the posteriors on w for all 18 cases of assumed priors and number of data points. What do you observe?

Hint: Use matplotlib.mlab.bivariate normal to generate the value of pdf for bivariate normals.

Solution:

import matplotlib
import numpy as np
import matplotlib . pyplot as plt np . random . seed (0)

5

7

9

11

13

15

17

19

21

23

25

1

3

def generate data(n):

”””
This function generates data of size n. ”””

X = np.random.randn(n,2)*np.sqrt(5) z = np.random.randn(n)
y = np.sum(X,axis=1) + z #* 0.3 return (X,y)

def tikhonov regression(X,Y,Sigma):

def

”””
This function computes w based on the formula of tikhonov regression . ”””

w = np.linalg.inv((X.T.dot(X)+np.linalg.inv(Sigma))).dot(X.T.dot(Y)) return w

compute mean var (X, y , Sigma ) :

”””
This function computes the mean and variance of the posterior ”””
mean = tikhonov regression(X,y,Sigma)

CS 189, Fall 2017, HW4 7

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

(e)

var = np.linalg.inv(X.T.dot(X)+np.linalg.inv(Sigma)) mux,muy = mean
sigmax = np.sqrt(var[0,0])
sigmay = np.sqrt(var[−1,−1])

sigmaxy = var [0 , −1]
return mux,muy, sigmax , sigmay , sigmaxy

Sigmas = [np. array ([[1 ,0] ,[0 ,1]]) ,
np. array ([[1 ,0.25] ,[0.25 ,1]]) ,

np. array ([[1 ,0.9] ,[0.9 ,1]]) ,
np . array ([[1 , −0.25] ,[ −0.25 ,1]]) , np . a r r a y ( [ [ 1 , − 0 . 9 ] , [ − 0 . 9 , 1 ] ] ) ,

np . a r r a y ( [ [ 0 . 1 , 0 ] , [ 0 , 0 . 1 ] ] ) ] names = [str(i) for i in range(1,6+1)] for num data in [5 ,50 ,500]:

X,Y = generate data(num data)

for

i , Sigma in enumerate ( Sigmas ) :

# compute the mean and covariance of posterior.

mux,muy,sigmax ,sigmay ,sigmaxy = compute mean var(X,Y,Sigma)

x = np. arange (0.5 , 1.5 , 0.01)
y = np. arange (0.5 , 1.5 , 0.01)
X grid, Y grid = np.meshgrid(x, y)
# Generate the function values of bivariate normal. Z = matplotlib.mlab.bivariate normal(X grid,Y grid,

sigmax , sigmay , mux, muy , sigmaxy )

# plot

plt.figure(figsize=(10,10))
CS = plt.contour(X grid, Y grid, Z,

levels = np.concatenate([np.arange(0,0.05,0.01),np.arange

(0.05 ,1 ,0.05) ]) )
plt.clabel(CS, inline=1, fontsize=10)
p l t . x l a b e l ( ’X ’ )
p l t . y l a b e l ( ’Y ’ )
plt.title(’Sigma’+ names[i] + ’ with num data = {}’.format(num data)) plt.savefig(’Sigma’+ names[i] + ’ num data {}.png’.format(num data)) plt . close ()

We can observe that a good prior matters, but enough data will wash away the effect of the prior.

(Influence of Priors) Generate n training data samples from Y = X1 +X2 +Z where X1,X2 ∼ N(0,5) and Z ∼ N(0,1) as before. Notice that the true parameters w1 = 1,w2 = 1 are mod- erately large and positively correlated with each other. We want to quantitatively understand how the effect of the prior influences the mean square error as we get more training data. This should corroborate the qualitative results you saw in the previous part.

In this case, we could directly compute the “test error” once we see the estimated parameters. Namely, if we estimate after learning that the model is Y = w 1X1 + w 2X2 + Z, the average test error should theoretically be 5(w 1 − 1)2 + 5(w 2 − 1)2 + 1. (Make sure you understand why.)

However, it is good to also keep in mind what happens with a finite amount of test data. So, generate a test set of 100 data points from this model and use the test error on that to evaluate

CS 189, Fall 2017, HW4 8

Figure1: n=5

Figure2: n=50 Figure 4: Σ1

Figure3: n=500

Figure5: n=5

Figure6: n=50 Figure 8: Σ2

Figure7: n=500

Figure9: n=5

Figure10: n=50 Figure 12: Σ3

Figure11: n=500

CS 189, Fall 2017, HW4

9

Figure13: n=5

Figure14: n=50 Figure 16: Σ4

Figure15: n=500

Figure17: n=5

Figure18: n=50 Figure 20: Σ5

Figure19: n=500

Figure21: n=5

Figure22: n=50 Figure 24: Σ6

Figure23: n=500

CS 189, Fall 2017, HW4

10

what happens as we use more training data.

If we just plotted the average test-error (or the theoretical average test error) with respect to the amount of training data, we still have the randomness in the specific training data that was used. Consequently, it is worth replicating this experiment a few times (say 20 times) to get an average of averages. (It is also insightful to look at the spread.)

Plot the mean square error between Yˆi and Yi over the test data with respect to the size of training data n (increase n from 5 to 200 by 5). Include what the theoretical mean square error should be for those parameter estimates. Compare what happens for different priors as the amount of training data increases.

Recall the MSE is defined by

1n
n ∑(Yi −Yˆi)2. (9)

• Enough data will wash away the effects of the prior.
• A good prior helps when there are not enough data points.

Solution:

import numpy as np
import matplotlib . pyplot as plt np.random.seed(0)

w=[1.0,1.0]
n test = 100
n trains = np.arange(5,205,5) #np.arange(10,110,10)#np.arange(100,1100,100) Sigmas = [np. array ([[1 ,0] ,[0 ,1]]) ,

np. array ([[1 ,0.25] ,[0.25 ,1]]) , np. array ([[1 ,0.9] ,[0.9 ,1]]) ,
np . array ([[1 , −0.25] ,[ −0.25 ,1]]) , np . a r r a y ( [ [ 1 , − 0 . 9 ] , [ − 0 . 9 , 1 ] ] ) ,

np . a r r a y ( [ [ 0 . 1 , 0 ] , [ 0 , 0 . 1 ] ] ) ]
names = [’Sigma{}’.format(i+1) for i in range(6)]
# names = [’uncorrelated ’,’positively correlated ’,’negatively correlated ’]

15

17

19

21

23

25

27

29

i=1
Here Yi = w Xi where Xi ∈ R and w in your model is the solution to the least square problem

ˆ

T2
with Tikhonov regularization given the training data. You should observe that

1

3

5

7

9

11

13

def generate data(n):

”””
This function generates data of size n. ”””

X =
z=
y =
return (X,y)

np.random.randn(n,2) * np.sqrt(5) np.random.randn(n) np.sum(X,axis=1) + z #* 0.3

def tikhonov regression(X,Y,Sigma):

”””
This function computes w based on the formula of tikhonov regression . ”””

w = np.linalg.inv((X.T.dot(X)+np.linalg.inv(Sigma))).dot(X.T.dot(Y))

CS 189, Fall 2017, HW4 11

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

def compute mse (X,Y, w) :

return w

#

”””
This function computes MSE given data and estimated w. ”””
mse = np.mean((np.squeeze(X.dot(w)) − Y)**2)
return mse

Generate Test Data .

X test , y test = generate data(n test)

mses = np.zeros((len(Sigmas), len(n trains), 10))

theoretical mses = np.zeros((len(Sigmas), len(n trains), 10))

for

seed in range (10) : np . random . seed ( seed )

for

i , Sigma in enumerate ( Sigmas ) :
for j,n train in enumerate(n trains):

X,y = generate data ( n train ) # Generate data .
w = tikhonov regression(X,y,Sigma) # Estimate w. mse = compute mse(X test , y test , w) # Compute MSE.

theoretical mses[i,j,seed] = sum((w−1) ** 2) * 5 + 1 mses[i,j,seed] = mse

plt . figure ()

# Plot

for i , in enumerate ( Sigmas ) :
plt.plot(n trains , np.mean(mses[i],axis = −1),label = names[i])

plt . xlabel ( ’Number of data ’) plt.ylabel(’MSE on Test Data’)

# plt.yscale(’log’)

plt . legend ()

# plt.show()

plt.savefig(’MSE.png’)

plt . figure ()

# Plot

for i , in enumerate ( Sigmas ) :
plt.plot(n trains , np.mean(theoretical mses[i],axis = −1),label = names[i])

plt.xlabel(’Number of data’) plt.ylabel(’MSE on Test Data’)

# plt.yscale(’log’)

plt . legend ()

# plt.show()

plt.savefig(’theoretical MSE.png’) 79

81

83

85

87

# log Plot

plt . figure ()
for i , in enumerate ( Sigmas ) :

plt.plot(n trains , np.mean(mses[i]−1,axis = −1),label = names[i]) plt . xlabel ( ’Number of data ’)
plt.ylabel(’MSE on Test Data’)
plt . xscale ( ’log ’)

plt.yscale(’log’) plt . legend ()

CS 189, Fall 2017, HW4 12

Figure 25: Average (over runs) MSE on Test Data. We can see this dropping but it is hard to understand what exactly is going on.

Figure 26: Average (over runs) Theoretical MSE. This reveals that there are still some wiggles — so the wiggles are also coming from variance in our estimator (given that we are only taking the average over a finite number of runs), not random peculiarities of our test data.

89

91

93

95

97

99

101

# plt.show()

plt.savefig(’log MSE.png’)

plt . figure ()

# Plot

for i , in enumerate ( Sigmas ) :
plt.plot(n trains , np.mean(theoretical mses[i]−1,axis = −1),label = names[i])

plt . xlabel ( ’Number of data ’) plt.ylabel(’MSE on Test Data’) plt . xscale ( ’log ’)
plt . yscale ( ’log ’)

plt . legend ()

# plt.show()

plt.savefig(’log theoretical MSE.png’)

We first plot the test MSE and theoretical MSE in the usual scale. We observe both of them decreases as n increases. But the lines are messed up with each other and indistinguishable. Then we use a log scale for both x and y axis. Interestingly, we observe a linear relationship under the log scale, which follows from the fact that MSE decreases as the reciprocal of n, the

CS 189, Fall 2017, HW4 13

Figure 27: Log log plot of average (over runs) theoretical MSE. By looking at a log-log plot, where both axes are on log scale, we can see more clearly the pattern — the quality of the prior impacts where we start with small amounts of training data, but the rate of convergence is given by a power-law.

4

number of training samples. And we can distinguish different priors better. Three observations:

• Enough data will wash away the effect of prior.
• A good prior helps when there are not enough data.
• How to plot is IMPORTANT for scientific observation.

Total Least Squares

Recall that in the least squares problem, we want to solve for ⃗w in min⃗w ||A⃗w −⃗y||. We measure ˆ

the error as the difference between A⃗w and ⃗y, which can be viewed as adding an error term ⃗y such ˆ

that the equation A⃗w = ⃗y +⃗y has a solution: ˆˆ

min||⃗y||F , subject to A⃗w = ⃗y +⃗y (10)

Although this optimization formulation allows for errors in the measurements of ⃗y, it does not

allow for errors in the feature matrix A that is measured from the data. In this problem, we will

explore a method called total least squares that allows for both error in the matrix A and the vector ˆˆ

⃗y, represented by A and ⃗y, respectively. Specifically, the total least squares problem is to find the solution for ⃗w in the following minimization problem:

ˆˆˆˆ
min||[A,⃗y]||F , subject to (A + A)⃗w = ⃗y +⃗y (11)

where the matrix [Aˆ,yˆ] is the concatenation of the columns of Aˆ with the column vector ⃗y. In- tuitively, this equation is finding the smallest perturbation to the matrix of data points A and the outputs ⃗y such that the linear model can be solved exactly. This minimization problem can be rewritten as

CS 189, Fall 2017, HW4 14

ˆ ˆ ⃗w ⃗
min[A + A,⃗y +⃗y] = 0 (12)

ˆˆ −1 A,⃗y

(a) Let the matrix A ∈ Rn×d and⃗y ∈ Rn. Assuming that n > d and rank(A+Aˆ) = d, explain why ˆˆ

rank([A + A,⃗y +⃗y]) = d. Solution:

Given that the constraint in the minimization problem from Equation (11) is satisfied, the last ˆˆˆ

column of the matrix [A + A,⃗y +⃗y]. which is ⃗y +⃗v can be expressed as a linear combination of the remaining columns in the matrix. Therefore, the rank of the matrix is the rank of the remaining columns, which we previously assumed was d.

ˆˆ
(b) We know that the matrix [A + A,⃗y +⃗y] is degenerate (has less than full rank). Therefore, by

the Eckart-Young-Mirsky Theorem which tells us what the closest lower-rank matrix in the ˆˆ

Frobenius norm is, the matrix [A + A,⃗y +⃗y] that minimizes the problem is given by

[A,⃗y]=UΣV⊤

ˆˆ⃗
mation, find a nonzero solution to [A + A,⃗y +⃗y]⃗x = 0 and thus solve for ⃗w in Equation (12).

HINT: the last column of the product [A,⃗y]V will be useful Solution:

Let’s start by expressing the SVD of [A,⃗y] in terms of submatrices and vectors:

and

Uxx [A,⃗y] = ⃗u⊤xy

ˆˆ

⃗uxy Σ1,…,m uyy

σm+1

Vxx ⃗vxy⊤ ⃗v⊤xy vyy

Σ1,…,d ⊤ [A+A,⃗y+⃗y]=U 0V

ˆˆ
where Σ1,..,d is the diagonal matrix of the d largest singular values of [A,⃗y]. Using this infor-

[A,⃗y] + [A,⃗y] =
We can subtract the second equation from the first to solve for [A,⃗y]:

Uxx ⃗u⊤xy

Vxx ⃗vxy⊤

⃗uxy Σ1,…,m
uyy 0 ⃗v⊤xy

CS 189, Fall 2017, HW4 15

ˆˆ

vyy

Uxx ⃗uxyΣ1,…,m [A,⃗y]−[A,⃗y]−[A,⃗y]= ⃗u⊤ u

xy yy
Uxx ⃗uxy Σ1,…,m

Vxx ⃗vxy⊤ Uxx ⃗uxyΣ1,…,m σ ⃗v⊤ v − ⃗u⊤ u

Vxx ⃗vxy⊤ 0 ⃗v⊤ v

and thus

which makes tion (12).

v a nonzero solution to the equation, and we have ⃗w = −⃗vxyvyy yy

from Equa-

ˆˆ ˆˆ

m+1 xy yy

xy yy Vxx

xy yy

−[A,⃗y]=⃗u⊤ u xy yy

σ − m+1

xy yy

U x x [A,⃗y]=− ⃗u⊤xy

⃗u x y 0 uyy

V x x ⃗v⊤xy

ˆˆ
=−u σm+1v

⃗uxy
yy yy

ˆˆ

σm+1 ⃗vxy ⊤

From the hint, we then look at the last column of the product [A,⃗y]V , which is given by

⃗vxy⃗uxy[A,⃗y] v = u σm+1 yy yy

and we can substitute this into the equation above to get

⃗v x y ⃗v x y ⊤ [A,⃗y]=−[A,⃗y] vyy vyy

⃗v x y ⃗v x y ⊤ [A,⃗y]+[A,⃗y]=[A,⃗y]−[A,⃗y] vyy vyy

ˆˆ

ˆ ˆ ⃗vxy ⃗vxy [A+A,⃗y+⃗y] v = [A,⃗y] v −[A,⃗y] v

Σ1,…,m ⃗v x y ⊤

vyy

⃗vxy ⊤ 0 ⃗v⊤ v

⃗vxyyy yy yy

=⃗0
⃗vxy −1

(c) In this problem, we will use total least squares to approximately learn the lighting in a photo- graph, which we can then use to paste new objects into the image while still maintaining the realism of the image. You will be estimating the lighting coefficients for the interior of St. Pe- ter’s Basillica, and you will then use these coefficients to change the lighting of an image of a tennis ball so that it can be pasted into the image. In Figure 28, we show the result of pasting the tennis ball in the image without adjusting the lighting on the ball. The ball looks too bright for the scene and does not look like it would fit in with other objects in the image.

CS 189, Fall 2017, HW4 16

Figure 28: Tennis ball pasted on top of image of St. Peter’s Basillica without lighting adjustment (left) and with lighting adjustment (right)

Figure 29: Image of a spherical mirror inside of St. Peter’s Basillica

To start, we will represent environment lighting as a spherical function ⃗f (⃗n) where ⃗n is a 3 dimensional unit vector (||⃗n||2 = 1), and ⃗f outputs a 3 dimensional color vector, one component for red, green, and blue light intensities. Because ⃗f (⃗n) is a spherical function, the input must correspond to a point on a sphere. In this case, we are using normalized vectors in 3 dimensions to represent a point on a sphere. The function ⃗f (⃗n) represents the total incoming light from the direction ⃗n in the scene.

To convincingly add an object to an image, we need to need to apply the lighting from the environment onto the added object. In this case, we’ll be adding a tennis ball to the scene. Because the tennis ball is what’s considered a diffuse surface, the color of a point on the ball can be computed as a weighted sum of all of the incoming light at that point that has a positive dot product with the direction ⃗n. This weighted sum can be approximated by blurring the image used to estimate the lighting environment. However, we skip this blurring step and instead approximate the spherical function ⃗f (⃗n) in terms of the first 9 spherical harmonic basis functions, where spherical harmonics are like the Fourier transform but for spherical functions. Because we are using only 9 of the basis functions, most of the information in the image cannot be captured, which will effectively blur the image.

The first 9 unormalized basis functions are given by:
CS 189, Fall 2017, HW4 17

L0,0 = 1 L1,−1 = y L1,1 = x L1,0 = z

L2,−2 = x ∗ y L2,−1 = y ∗ z

L2,0 = 3 ∗ z2 − 1 L2,1 = x ∗ z
L2,2 = x2 − y2

where⃗n = [x,y,z]⊤. The lighting function can then be approximated as

9

⃗f ( ⃗n ) ≈ ∑ ⃗γ i L i ( ⃗n ) i=1

where Li(⃗n) is the ith basis function from the list above.

The function of incoming light ⃗f(⃗n) can be measured by photographing a spherical mirror placed in the scene of interest. In this case, we provide you with an image of the sphere as seen in Figure 29. In the code provided, there is a function extractNormals(img) that will extract the training pairs (⃗ni,⃗f(⃗ni)) from the image. An example using this function is in the code.

Use the spherical harmonic basis functions to create a 9 dimensional feature vector for each sample. Use this to formulate an ordinary least squares problem and solve for the unknown coefficients ⃗γi. Report the estimated values for ⃗γi and include a visualization of the approximation using the provided code. The code provided will load the images, extracts the training data, relights the tennis ball with incorrect coefficients, and saves the results. Your task is to compute the basis functions and solve the least squares problems to provide the code with the correct coefficients. To run the starter code, you will need to use Python with numpy and scipy. Because the resulting data set is large, we reduce it in the code by taking every 50th entry in the data. This is done for you in the starter code, but you can try using the entire data set or reduce it by a different amount.

Solution:

2

4

6

8

  • [  202.31845431
  • [  −27.66555164
  • [  −1.08629293
  • [  −5.15203925
  • [  −3.14053107
  • [  23.67671768
  • [  −3.82167171
  • [  4.7346737
  • [  −9.72739616

162.41956802 −17.88905339 0.42947012 −4.51375871 −3.70269907 23.15698002 0.57606634

1.4677692 −5.75691108

149.07075034] −12.92356688] 1.15475569] −4.24262639] −3.74382934] 21.94638397] 1.81637483] −1.12253649] −4.8395598 ]

CS 189, Fall 2017, HW4 18

(d) When we estimate the direction ⃗n to compute our training data (⃗ni,⃗f(⃗ni)), we make some approximations about how the light is captured on the image. We also assume that the spherical mirror is a perfect sphere, but in reality, there will always be small imperfections. Thus, our measurement for ⃗n contains some error, which makes this an ideal problem to apply total least squares. Solve this problem with total least squares by allowing perturbations in the matrix of basis functions. Report the estimated values for ⃗γi and include a visualization of the approximation. The output image will be visibly wrong, and we’ll explore how to fix this problem in the next part. Your implementation may only use the SVD and the matrix inverse functions from the linear algebra library in numpy.

Solution:

1

3

5

7

9

[ 2.13318421e+02 1.70780299e+02 1.57126297e+02] [ −3.23046362e+01 −2.02975310e+01 −1.45516114e+01] [ −4.89811386e+00 −3.37684058e+00 −1.14207091e+00] [ −4.31689131e+00 −3.80778081e+00 −4.83616306e+00] [ −7.05901066e+03 −7.39934207e+03 −4.26448732e+03] [ −3.05378224e+02 −1.56329401e+02 3.50285345e+02] [ −9.76079364e+00 −5.33182216e+00 −1.55699782e+00] [ 7.30792588e+02 3.52130316e+02 −6.11683200e+02] [ −9.08887079e+00 −3.84309477e+00 −4.16456437e+00]

CS 189, Fall 2017, HW4 19

(e) In the previous part, you should have noticed that the visualization is drastically different than the one generated using least squares. This difference is caused by the difference in scale between the inputs and the outputs. The inputs are all unit vectors, and the outputs are 3 dimensional vectors where each value is in [0,384]. Typically, values in an image will be in [0, 255], but the original image had a much larger range. We compressed the range to a smaller scale using tone mapping, but the effect of the compression is that relatively bright areas of the image become less bright. As a compromise, we scaled the image colors down to a maximum color value of 384 instead of 255. Explain why the difference in the range of values for the inputs Li(⃗n) and the outputs ⃗f (⃗ni) would create this difference in results when solving the total least squares problem. Propose a value by which to scale the outputs ⃗f (⃗ni) such that the values of the inputs and outputs are roughly on the same scale. Solve this scaled total least squares problem, report the computed spherical harmonic coefficients and provide a rendering of the relit sphere.

Solution:

Recall: total least squares assumes that the noise is the same in all directions. When we just have some data, it can be hard to think about what the noise model should be. The default is to take a “significant figures” mentality and assume that everything should be on the same scale, assuming that noise is roughly proportional to the size of the inputs. Because most of the basis functions lie within [−1, 1], we want to scale the image pixel values so that they lie in a similar range. For these results, we scaled the values by 1/384, but any reasonable value that scales the pixel values to a similar range is acceptable.

1

3

5

7

9

[ [ [ [ [ [ [ [ [

209.41539449 −30.28100667 −1.05621451

−5.7563859 −7.95607504 55.29299419 −3.84934062

7.35375998 −10.91282516

169.06782937 −20.3163958

0.46391495 −5.08161145 −8.25078526 52.93568994

0.5565465

3.85567243 −6.85792251

155.39642541] −15.21596685] 1.19212042] −4.78411908] −8.0969764 ] 50.39069818] 1.80231959] 1.0984583 ] −5.88064457]

CS 189, Fall 2017, HW4 20

CS 189, Fall 2017, HW4 21

5 Your Own Question

Write your own question, and provide a thorough solution.

Writing your own problems is a very important way to really learn material. The famous “Bloom’s Taxonomy” that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we don’t want the same to be true about the highest level.

As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.

Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You don’t have to achieve this every week. But unless you try every week, it probably won’t happen ever.

CS 189, Fall 2017, HW4 22

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 机器学习代写:cs 189 HW4
30 $