Wojciech Samek & Gregoire Montavon
About myself
1. Interpretability & Explainability 2. Neural Network Compression 3. Federated Learning
4. Applications of Deep Learning
ML1 Lecture 3: Dimensionality Reduction and Principle Component Analysis
2
This Lecture
1. Dimensionality reduction
2. PrincipleComponentAnalysis
1. What are Principle Components? 2. How to find/calculate them
1. Lagrange Multipliers
3. What can we do with them? / Applications
2
Curse of dimensionality In many machine learning problems, the data are high-dimensional
From Limb and Braun, 2008 University of Colorado IEEE Spectrum
3
Curse of dimensionality In many machine learning problems, the data are high-dimensional
From Limb and Braun, 2008 University of Colorado IEEE Spectrum
Standard regression/classification techniques can become
ill-defined for M >> N
ill-conditioned/numerically unstable even for M < N 4 Curse of dimensionality In many machine learning problems, the data are high-dimensionalFrom Limb and Braun, 2008 University of Colorado IEEE SpectrumStandard regression/classification techniques can becomeill-defined for M >> N
ill-conditioned/numerically unstable even for M < N w = 1(1 2)5Singular, ill-conditionedCurse of dimensionality6 RegularizationIdea: impose constraints on the parameters to stabilize solution. One way: introduce prior probabilityRegularizationIdea: impose constraints on the parameters to stabilize solution.One way: introduce prior probabilityLinear regression setting: y = X + , N (0, 2I) , = (X X)1X y 8 RegularizationIdea: impose constraints on the parameters to stabilize solution.One way: introduce prior probabilityLinear regression setting: y = X + , N (0, 2I) , = (X X)1X y Prior: N(0, 2I) 9 RegularizationIdea: impose constraints on the parameters to stabilize solution.One way: introduce prior probabilityLinear regression setting: y = X + , N (0, 2I) , = (X X)1X y Prior: N(0, 2I) 10= favor a simple model Dimensionality reductionProblem: we still have to invert an M-dimensional matrix expensive Goal: reduce data to features most relevant for the learning task 14 Dimensionality reductionProblem: we still have to invert an M-dimensional matrix expensive Goal: reduce data to features most relevant for the learning taskOne way: significance test for single features 15 Dimensionality reductionProblem: we still have to invert an M-dimensional matrix expensive Goal: reduce data to features most relevant for the learning taskOne way: significance test for single featuresAnother way: find relevant directions/subspaces in correlated data16 Why dimensionality reduction?17 Principle Component Analysis (PCA)NoiseWhich line fits data best?The line w that minimizes the noise and maximizes the signal [Pearson 1901].Signal 1. principle component Reference: Bishop, Pattern Recognition and Machine Learning, Chapter 12.1 18 Problem DefinitionGiven x1, . . . , xn 2 Rd find a k-dimensional subspace, so that the data projected on that subspace1) is as close to the original data as possible (minimum noise)2) has maximum variance (maximum signal). 2 Motivations: same result? Assume the data is centered: =1 Xn n i=1xi = 019 Minimum distance wWe are only interested in the direction of w, not its length, i.e. set ||w|| = wTw = 1 .Xn w||(wTxi)w xi||2wTxixTi wwTw 2wTxixTi w + xTi xi min= mini=1 Xnwi=1ReminderaaTb b ||b|| ||b||baTb ||b|| Xn w=min wTxixTi w wTxixTi w=maxi=1 Xnwi=1=maxwTXXTw w20 Maximum Variancew maxVar(wTX) = E[(wTX)2] wIntroduce the scatter matrix S = XX . This leads to the constrained optimization problem:max wTSw ws.t. ||w|| = 1 211 Xn w wTxixTi w nT =max =maxwTXXTw i=1 w(same as minimum distance)X is centered Lagrange MultipliersImagine you have a constrained optimization problem: max f (x) Read: subject to xObjective functionConstraintExample:s.t. g(x) = 0max1 x21 x2 xs.t. x1 +x2 1=0Reference: Bishop, Pattern Recognition and Machine Learning, Appendix E22 Geometric intuitionmax f (x) xs.t. g(x) = 0 Intuition 1: g must benormal to the line g(x, y) = c.Intuition 2: At an optimum f(x*,y*) can not increase in the direction of a neighboring point where g(x, y) = c. Thus, at (x*, y*) f must be perpendicular to the line g(x, y) = c.It follows: In (x*, y*) the gradients g and f are parallel!23Contour lines of f(x,y)Constraint The Lagrange MultiplierIn (x*, y*) the gradients g and f are parallel. Thus for a maximum:rf = rgLagrange MultiplierLagrangian function:L(x, ) = f (x) g(x)L = 0 is a necessary (but not sufficient) condition for the optimization solution.24Thus, to solve a constrainedoptimization problem, we can define the Lagrangian and look for the points where its gradient vanishes.Examples.t. x1 +x2 1=0 1. Define Lagrangian:L(x, )=1 x2 x2 + (x1 +x2 1) 12 2. Set gradient of Lagrangian to zero: 2×1 + = 0 2×2 + = 0 x1 +x2 1=03. Solve equations: (x ,x ) = (1, 1)25 = 1max1 x2 x2 x121222 PCA optimization problemmax wTSw ws.t. ||w|| = 1 1.DefineLagrangian: L(w, )=wTSw+ (1 wTw) 2. Compute gradient:@L(w, ) =2Sw 2 w @wSw = w 3. Set to zero:26 PCA optimization problemmax wTSw ws.t. ||w|| = 1 1.DefineLagrangian: L(w, )=wTSw+ (1 wTw)2. Compute gradient:@L(w, ) =2Sw 2 w @wSw = w 3. Set to zero:This is an Eigenvalue problem!27 Eigenvalue problemsSw = w (S I)w = 0 28 Eigenvalue problemsSw = w (S I)w = 0Solutions are found at roots of characteristic polynomialp() := det(S I) = 0 29 Eigenvalue problemsSw = w (S I)w = 0Solutions are found at roots of characteristic polynomialp() := det(S I) = 0 In general: d roots (eigenvalues) 1 , . . . , dw1,…,wd Corresponding eigenvectors:30 Eigenvalue problemsSw = w (S I)w = 0Solutions are found at roots of characteristic polynomialp() := det(S I) = 0 In general: d roots (eigenvalues) 1 , . . . , d w1,…,wdLeads to the decomposition W SW = S = W W where W = [w1,…,wd],W W = I is orthogonal andis diagonal. 31Corresponding eigenvectors: , ii =i PCA algorithmThe i-th eigenvalue is the variance in the direction of the i-th eigenvector:Var(wiTx)= 1wiTSwi = 1 iwiTwi = 1 i nnn32 PCA algorithmThe i-th eigenvalue is the variance in the direction of the i-th eigenvector:Var(wiTx)= 1wiTSwi = 1 iwiTwi = 1 i nnnThe direction of largest variance corresponds to the largest eigenvector (= the eigenvector with largest eigenvalue).33 PCA algorithmThe i-th eigenvalue is the variance in the direction of the i-th eigenvector:Var(wiTx)= 1wiTSwi = 1 iwiTwi = 1 i nnn34The direction of largest variance corresponds to the largest eigenvector (= the eigenvector with largest eigenvalue).35 Solving PCA with SVD A singular value decomposition factorizes a matrix as: whereU V =MU are the Eigenvectors of MM .V are the Eigenvectors of M M.The square roots of the Eigenvalues of MM are on the diagonal of .37 Solving PCA with SVD A singular value decomposition factorizes a matrix as: whereU V =MU are the Eigenvectors of MM .V are the Eigenvectors of M M.The square roots of the Eigenvalues of MM are on the diagonal of .Instead of calculating the EV-decomposition of S, we can compute the SVD-decomposition of X. This is computationally much more stable.E.g. Lauchli Matrix38SVD in numpy!39 Power iterationThe SVD has computational complexity O(min(n2d, d2n)).But: We often only need a few largest principle components and not all of them.or power iteration The solution is the first eigenvector of A. 40 Power iteration intuitionA = U UT Assume a diagonalizable matrix .The power iteration computes after step k: Akx where x is a random start vector.kkTT WehaveA=U U becauseUU=I.Transform x into a new coordinate system: x = UTx dd k kT k Xk kX kiA x=U U x=U x = ( ix i)ui = 1 ( kx i)ui k i=1 Ax k!1i=1 1) ||Akx|| !u1 because i 1for 1 2 d 0 141DeflationTo find the second EV, remember S = W WT =Xdi Thus, do power iteration on S = S 1w1w1T .i=142 iwiwT. Application: Dimensionality ReductionNoisewSignalWe can reduce the dimensionality of our dataset by projecting on the first k principle components.How much signal are we going to loose?ddX Var(wTx) = 1 X ini=k+1number of components Projection on k principle components:0 wTx 1 1B . C=4w1 wk5x @wkTxA | |2| |3Ti=k+143 Application: EigenfacesIdea: Faces look very similar in comparison to other random images. How many principle components would we need to accurately describe all faces?An 64×64 pixel image of a face can be represented as a 4096 dimensional vector where each entry has the pixels grayscale value.Reference: Turk, Matthew A and Pentland, Alex P. Face recognition using eigenfaces. Computer Vision and Pattern Recognition, 1991.44 EigenfacesThe principle components are directions in our faces space. Thus, each principle component is a face representation, too.Principle component 1 Principle component 2Principle component 3 Principle component 350 45Variance in first 50 direction Eigenfaces46 Application: Denoising We can use PCA to denoise data:Step 1: Reduce dimensionality to filter out noise directions: (w1Tx, . . . , wkTx)TStep 2: Project back into original space:Xk(wiT x)wii=1Step 3: Undo centering:kn X ( w iT x ) w i + X x ii=1 i=1 47 Application: DenoisingMann et al., 201348 Application: EEG artifacts In electroencephalographic (EEG) recordings, eye blink artifacts can be stronger than the neuronal activity. 49BCI2000 Application: EEG artifacts In electroencephalographic (EEG) recordings, eye blink artifacts can be stronger than the neuronal activity. BCI2000 reasonable to remove first principal components50 How many principal components?51 How many principal components?Heuristics52 How robust is PCA to outliers?Not very robust53!
Reviews
There are no reviews yet.