CS5487 Problem Set 7
Linear Dimensionality Reduction
Antoni of Computer Science
Copyright By Assignmentchef assignmentchef
City University of
Dimensionality
Problem 7.1 Shell of a hypersphere
Consider a hypersphere S1 of radius r in Rd, and its shell of thickness . The shell can be defined
as the region between S1 and another hypersphere S2 with radius r . Hence, the volume of the
V (shell) = V (S1) V (S2) =
V (S1), (7.1)
where the volume of a d-dimensional hypersphere of radius r is
(a) Show that the ratio between the volumes of the two sphere is
and hence, for any 0 < < r,= 0. (7.4)This shows that for high dimension V (shell) = V (S1). In other words, all the volume of thehypersphere is in its shell!. . . . . . . . .Principal Component AnalysisProblem 7.2 PCA as minimizing reconstruction errorIn this problem, we will prove that PCA finds a representation that minimizes the reconstructionerror of the data points. Let X = {x1, , xn} be the data points, with xi Rd. Let the newrepresentation (the PCA coefficients) be Z = {z1, , zn}, with zi Rk. Denote = [1, , k] Rdk as a matrix of basis vectors j , and c Rd as the offset. The basis vectors are constrained tobe orthonormal, i.e., Tj j = 1 and j i = 0 for i 6= j, or succinctly T = I. The reconstructionof a data point from its lower-dimensional representation isxi = c+ zi = c+jzij , (7.5)where zij is the jth element of zi.The total reconstruction error is given byJ(Z,, c) =xi xi2 . (7.6)Hence, the goal is to find the optimal coefficients Z, basis matrix , and offset c that minimize J ,{Z,, c} = argminJ(Z,, c) s.t. T = I. (7.7)(a) Note that only Z depends on X, hence we can first solve for the optimal Z as a function of{c,}. Show that the optimal coefficients Z that minimize J areT (xi c). (7.8)(b) Substitute (7.8) into J to obtain a new objective function J(c,),(xi c)T (I T )(xi c). (7.9)(c) Show that the offset c that minimizes J isxi, (7.10)i.e., the sample mean.(d) To find the optimal , first show that the optimization of J can be rewritten as = argminJ(c,) s.t. T = I (7.11)tr(T) s.t. T = I (7.12)Ti i s.t. T = I, (7.13)where = 1i=1(xi c)(xi c)T is the sample covariance matrix.(e) Use Langrange multipliers to show that the stationary point for j of the constrained optimiza-tion problem in (7.13) is an eigenvector of , i.e.j = jj , (7.14)where j is the corresponding eigenvalue. Finally, using this fact and (7.12), show that = argmaxtr() = argmaxi, (7.15)which shows that the optimal are the eigenvectors corresponding to the k largest eigenvalues. . . . . . . . .Problem 7.3 PCA as maximizing varianceIn this problem, we will show that PCA maximizes the variance of the projected data. Let X ={x1, , xn} be the data points, with xi Rd. Define the set of projection directions as =[1, , k] Rdk, where the i are orthonormal, i.e., Tj j = 1 and Tj i = 0 for i 6= j, orsuccinctly T = I. Let x = 1i xi be the sample mean of X.The projected data is Z = {z1, , zn}, with zi Rk, whereT (xi x). (7.16)The goal is to maximize the variance of the projected data Z by selecting the optimal , = argmax(zi z)(zi z)Ts.t. T = I, (7.17)where z = 1i zi is the sample mean of Z.(a) Show that z = 0.(b) Use (a) to show that (7.17) is equivalent to (7.12). Hence, maximizing the variance of theprojected data is equivalent to minimizing the reconstruction error, leading to the PCA formu-. . . . . . . . .Problem 7.4 PCA implementation using SVDIn this problem, we will consider implementing PCA using the singular value decomposition (SVD,see Problem 1.17). Let X = [x1, , xn] be the matrix of data points with sample mean andcovariance,(xi )(xi )T . (7.18)Let X be the matrix of mean-subtracted points,X = [x1 , , xn ], (7.19)and construct the SVD of X,X = USV T . (7.20)(a) Show that X can be written asX = X(I 111T ), (7.21)where 1 is the vector of ones.(b) Show that {ui, s} is an eigenvector/eigenvalue pair of , where ui is the ith column of U .(c) Rewrite the PCA training algorithm to use the SVD.. . . . . . . . .Problem 7.5 PCA and classificationConsider a classification problem with two classes y {1, 2} with class probabilities,p(y = 1) = p(y = 2) = 1/2, (7.22)and Gaussian class conditionals,p(x|y = j) = N (x|j ,j). (7.23)(a) The random variable x is not Gaussian, but we can still compute its mean and covariance.x = E[x] =(1 + 2), (7.24)x = E[(x x)(x x)T ] =(1 + 2) +(1 2)(1 2)T (7.25)(b) Consider the 2-dimensional case, where1 = 2 = =with > 0, and
1 = 2 = =
Use the principal component analysis of x to determine the best 1-dimensional subspace, i.e.
the transformation
z = Tx (7.28)
where is the eigenvector corresponding to the largest eigenvalue of x. For this problem,
is the PCA approach to dimensionality reduction a good one from the classification point of
view? Explain why.
. . . . . . . . .
Linear Discriminant Analysis
Problem 7.6 Fishers linear discriminant
Suppose we have feature vectors from two classes, where the sample mean and scatter matrix is
given by {1, S1} for class 1, and {2, S2} for class 2. The Fishers linear discriminant (FLD) finds
the optimal projection that maximizes the ratio of the between-class scatter and the within-
class scatter,
w = argmax
J(w), J(w) =
where SB and SW are the between- and within-class scatter matrices,
SB = (1 2)(1 2)T , SW = S1 + S2. (7.30)
In this problem, we will derive the optimal w.
(a) Because (7.29) is maximizing a ratio in J(w), it suffices to fix the denominator to some value
and maximize the numerator. Use Lagrange multipliers to rewrite (7.29) as a constrained
optimization problem where wTSWw = 1.
(b) Show that the stationary point of the constrained optimization problem is given by the gener-
alized eigenvalue problem,
SBw = SWw, (7.31)
where is the Lagrange multiplier.
(c) Finally, assuming that SW is invertible, show that the optimal w
is given by
w S1W (1 2). (7.32)
Hint: we are only interested in the direction of the best projection w to maximize J(w); the
length of w does not affect the ratio.
. . . . . . . . .
Problem 7.7 Fishers linear discriminant as least-squares regression
In this problem, we will show that Fishers linear discriminant is equivalent to least-squares regres-
sion for a particular set of target values. Let X = {x1, , xn} be the data points with xi Rd,
and C1 be the set of points in class 1 (with size n1), and likewise for class 2 (C2, n2). Define the
following sample statistics of the data for j {1, 2},
(xi j)(xi j)T . (7.33)
We want to learn a linear projection of the data to separate the two classes,
Txi + b, (7.34)
where zi R is the projected data point, w Rd is the projection vector and b R is the bias
term. We will formulate this as a regression problem. For each data point xi, we define the target
value as the reciprocal of the class probability,
The best projection is then obtained by minimizing the least-squares error between the projection
zi and the target ti,
{w, b} = argmin
E(w, b), (7.36)
(zi ti)2 =
(wTxi + b ti)2. (7.37)
This is a linear regression problem where the target values ti are based on the class of each point.
(a) Show that
i=1 ti = 0.
(b) Show that the optimal b can be expressed as a function of w,
b = wT, = 1
(n11 + n22). (7.38)
(c) Using (b), show that a stationary point for w satistfies
SB)w = n(1 2), (7.39)
where SW and SB are the within- and between-class matrices,
SW = S1 + S2, SB = (2 1)(2 1)T . (7.40)
(d) Finally, show that the direction of the optimal w can be written as,
w S1W (2 1) (7.41)
Hint: note that SBw is always in the direction of 2 1.
. . . . . . . . .
Probabilistic PCA and Factor Analysis
Problem 7.8 Probabilistic PCA: the model
In this problem, we will consider a more probabilistic version of PCA, called probabilistic PCA
(PPCA). Let x Rd be the observation and z Rk the lower-dimensional representation (called
a latent variable). We assume the observation x is generated from latent variable z using a linear
transformation and mean offset, with some Gaussian observation noise,
x = z + + , (7.42)
where Rdk is the linear transformation, Rd the mean vector, and N (0, 2I) the
observation noise. (Note that we do not impose an orthonormal constraint on the columns of ).
Hence, the conditional distribution of the observation x given z is
p(x|z) = N (x|z + , 2I). (7.43)
To complete the probabilistic model, we assume a Gaussian prior distribution on the latent variables,
p(z) = N (z|0, I). (7.44)
Note that this probabilistic PCA is a slightly different formulation than standard PCA, in that we
are directly modeling the generation of the data from the latent space representation. In contrast,
PCA learns the mapping from observations to latent space.
In this problem we will derive the marginal, joint likelihood, and posterior of the PPCA
model. In the next problem, we will consider learning the model using maximum likelihood.
(a) Show that the marginal distribution of x is
p(x|z)p(z)dz = N (x|,T + 2I). (7.45)
Hint: use Problem 1.9.
(b) Looking at p(x), what is the major difference between the models of PPCA and PCA?
(c) Show that the joint likelihood of x and z is
p(x, z) = N (
T + 2I T
Hint: define a = [xT , zT ]T and find its mean and covariance.
(d) Show that the posterior likelihood of z given x is
p(z|x) = N (z|z|x,z|x), (7.47)
T (T + 2I)1(x ), (7.48)
z|x = I T (T + 2I)1. (7.49)
Hint: complete the square.
(e) Use the matrix inverse identities in Problem 1.15 to show that
1T (x ), z|x = 2M1, M = T + 2I. (7.50)
Hence, calculation of p(z|x) requires inverting only a k k matrix, rather than a d d matrix.
. . . . . . . . .
Problem 7.9 Probabilistic PCA: EM learning
Given a set of data points X = {x1, , xn}, the goal is to learn the parameters {, , 2} of the
probabilistic PCA model (see Problem 7.8) that maximize the log-likelihood of the data,
log p(X) =
log p(xi) =
logN (xi|,T + 2I). (7.51)
In this case taking the derivative and solving for the parameters will be quite complex (it is pos-
sible though). Alternatively, because our model has latent variables Z = {z1, , zn} that are
unobserved, we can also find the maximum likelihood parameters using the EM algorithm.
(a) Show that the Q function can be written as
Q(, ) = E
Z|X,[log p(X,Z|)] (7.52)
xi 2
T (xi ) +
where the required expectations for the E-step are
zi = EZ|X,[zi] = M
1T (xi ) (7.54)
Pi = EZ|X,[ziz
2M1 + ziz
i , (7.55)
and the parameters used to calculate these expectations are from the old model .
(b) Show that the M-step is given by
(xi )zTi
xi 2 2zTi T (xi ) + tr(PiT )
(c) Taking 2 0 yields the standard PCA model. Let Z = [z1, , zn] be the matrix of estimated
latent variables, and X = [x1, , xn] the matrix of mean-subtracted data points. Show
that the resulting EM algorithm for 0 is,
PCA E step : Z = (T)1T X (7.58)
PCA M step : = XZT (ZZT )1 (7.59)
Intuitively, what are the E- and M-steps doing in this algorithm?
(d) Analyze the complexity of the EM algorithm for PCA versus using eigen-decomposition for
PCA. Recall that computing: k eigenvalues of a d d matrix is O(kd2); the inverse of a k k
matrix is O(k3); and the outer product of two vectors (Ra and Rb) is O(ab). When will it be
advantageous to use EM over the standard formulation?
. . . . . . . . .
Problem 7.10 Factor Analysis
Factor analysis (FA) is closely related to probabilistic PCA (Problem 7.8). The main difference is
that the observation noise is assumed to have a diagonal covariance matrix, rather than an isotropic
form, i.e N (0|,), where is a diagonal covariance matrix. The conditional distribution of x
p(x|z) = N (x|z + ,). (7.60)
In essence, FA is also capturing correlations in the data using , while modeling the observation
noise of individual features as independent, but with possibly different variance. In FA literature,
the columns of are called factor loadings, and the diagonal elements of are called uniquenesses.
Given a set of samples X = {x1, , xn}, the parameters of the FA model can be learned
using the EM algorithm, similar to probabilistic PCA.
(a) Show that the E-step for FA is to calculate:
G = (I + T1)1, (7.61)
T1(xi ), (7.62)
Pi = G+ ziz
i . (7.63)
(b) Show that the M-step for FA is:
(xi )zTi
(xi )(xi )T
zi(xi )T
where the diag operator sets all non-diagonal elements of a matrix to zero.
. . . . . . . . .
Canonical Correlation Analysis
Problem 7.11 Canonical correlation analysis (CCA)
Canonical correlation analysis (CCA) is a dimensionality-reduction method similar to PCA. While
PCA deals with only one data space, CCA is method for jointly reducing the dimension across two
spaces. Each dimension of the lower-dimensional representation corresponds to correlated directions
in the data spaces (i.e, the two directions describe the same thing).
Formally, let x Rd and y RD be a pair of observation random variables (with different
dimensions), with covariances matrices
xx = E[(x x)(x x)T ] (7.66)
yy = E[(y y)(y y)T ] (7.67)
xy = E[(x x)(y y)T ] = Tyx, (7.68)
where xy is the covariance between the two variables. The covariance matrices can be computed
using sample data pairs {xi, yi}ni=1. Let zx and zy be linear projections of x and y,
x x, zy = w
y y, (7.69)
where wx Rd and wy RD are the projection directions. The goal is to find {wx, wy} that
maximizes the correlation between zx and zy,
{wx, wy} = argmax
wx 6=0,wy 6=0
cov(zx, zy)
var(zx)var(zy)
In this problem, we will derive the solution using the SVD.
(a) Show that the covariances and variances in (7.71) are:
cov(zx, zy) = w
xxywy, (7.72)
var(zx) = w
xxxwx, (7.73)
var(zy) = w
y yywy. (7.74)
Hence, the correlation function is
(b) Perform a change of basis on (7.75),
xx wx, wy =
yy wy, (7.76)
where A1/2 is the matrix square-root (A = A1/2A1/2), and derive an equivalent optimization
problem to (7.70),
{u, v} = argmax
uT1/2xx xy
yy v, s.t. u
Tu = 1, vT v = 1. (7.77)
, wy =
(c) Consider the SVD decomposition
1/2xx xy
T . (7.79)
Show that (7.77) is maximized with u = u1 and v
= v1, where u1 and v1 are the left and
right singular vectors corresponding to the largest singular value (i.e., the first columns of U
. . . . . . . . .
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.