Wojciech Samek & Grégoire Montavon
About myself
1. Interpretability & Explainability 2. Neural Network Compression 3. Federated Learning
4. Applications of Deep Learning
ML1 Lecture 3: Dimensionality Reduction and Principle Component Analysis
2
This Lecture
1. Dimensionality reduction
2. PrincipleComponentAnalysis
1. What are Principle Components? 2. How to find/calculate them
1. Lagrange Multipliers
3. What can we do with them? / Applications
2
Curse of dimensionality In many machine learning problems, the data are high-dimensional
From Limb and Braun, 2008 University of Colorado IEEE Spectrum
3
Curse of dimensionality In many machine learning problems, the data are high-dimensional
From Limb and Braun, 2008 University of Colorado IEEE Spectrum
Standard regression/classification techniques can become
ill-defined for M >> N
ill-conditioned/numerically unstable even for M < N• •4 Curse of dimensionality In many machine learning problems, the data are high-dimensionalFrom Limb and Braun, 2008 University of Colorado IEEE SpectrumStandard regression/classification techniques can becomeill-defined for M >> N
ill-conditioned/numerically unstable even for M < N• •w = 1(μ1 μ2)5Singular, ill-conditionedCurse of dimensionality6 RegularizationIdea: impose constraints on the parameters to stabilize solution. One way: introduce prior probabilityRegularizationIdea: impose constraints on the parameters to stabilize solution.One way: introduce prior probabilityLinear regression setting: y = X + ⇥ , ⇥ N (0, 2I) , ˆ = (X X)⇥1X y 8 RegularizationIdea: impose constraints on the parameters to stabilize solution.One way: introduce prior probabilityLinear regression setting: y = X + ⇥ , ⇥ N (0, 2I) , ˆ = (X X)⇥1X y Prior: N(0, 2I) 9 RegularizationIdea: impose constraints on the parameters to stabilize solution.One way: introduce prior probabilityLinear regression setting: y = X + ⇥ , ⇥ N (0, 2I) , ˆ = (X X)⇥1X y Prior: N(0, 2I) 10= favor a “simple” model Dimensionality reductionProblem: we still have to invert an M-dimensional matrix → expensive Goal: reduce data to features most relevant for the learning task 14 Dimensionality reductionProblem: we still have to invert an M-dimensional matrix → expensive Goal: reduce data to features most relevant for the learning taskOne way: significance test for single features 15 Dimensionality reductionProblem: we still have to invert an M-dimensional matrix → expensive Goal: reduce data to features most relevant for the learning taskOne way: significance test for single featuresAnother way: find ‘relevant’ directions/subspaces in correlated data16 Why dimensionality reduction?17 Principle Component Analysis (PCA)NoiseWhich line fits data best?The line w that minimizes the noise and maximizes the signal [Pearson 1901].Signal 1. principle component Reference: Bishop, Pattern Recognition and Machine Learning, Chapter 12.1 18 Problem DefinitionGiven x1, . . . , xn 2 Rd find a k-dimensional subspace, so that the data projected on that subspace1) is as close to the original data as possible (minimum noise)2) has maximum variance (maximum signal). 2 Motivations: same result? Assume the data is centered: μ =1 Xn n i=1xi = 019 Minimum distance wWe are only interested in the direction of w, not its length, i.e. set ||w|| = wTw = 1 .Xn w||(wTxi)w xi||2wTxixTi wwTw 2wTxixTi w + xTi xi min= mini=1 Xnwi=1ReminderaaTb b ||b|| ||b||baTb ||b|| Xn w=min wTxixTi w wTxixTi w=maxi=1 Xnwi=1=maxwTXXTw w20 Maximum Variancew maxVar(wTX) = E[(wTX)2] wIntroduce the scatter matrix S = XX . This leads to the constrained optimization problem:max wTSw ws.t. ||w|| = 1 211 Xn w wTxixTi w n✓T =max =maxwTXXTw i=1 w(same as minimum distance)X is centered Lagrange MultipliersImagine you have a constrained optimization problem: •max f (x) Read: subject to xObjective functionConstraint•Example:s.t. g(x) = 0max1 x21 x2 xs.t. x1 +x2 1=0Reference: Bishop, Pattern Recognition and Machine Learning, Appendix E22 Geometric intuitionmax f (x) xs.t. g(x) = 0 Intuition 1: ∇g must benormal to the line g(x, y) = c.Intuition 2: At an optimum f(x*,y*) can not increase in the direction of a neighboring point where g(x, y) = c. Thus, at (x*, y*) ∇f must be perpendicular to the line g(x, y) = c.It follows: In (x*, y*) the gradients ∇g and ∇f are parallel!23Contour lines of f(x,y)Constraint The Lagrange MultiplierIn (x*, y*) the gradients ∇g and ∇f are parallel. Thus for a maximum:rf = rgLagrange MultiplierLagrangian function:L(x, ⇥) = f (x) ⇥ ⇥g(x)∇L = 0 is a necessary (but not sufficient) condition for the optimization solution.24Thus, to solve a constrainedoptimization problem, we can define the Lagrangian and look for the points where its gradient vanishes.Examples.t. x1 +x2 1=0 1. Define Lagrangian:L(x, )=1 x2 x2 + (x1 +x2 1) 12 2. Set gradient of Lagrangian to zero: 2×1 + = 0 2×2 + = 0 x1 +x2 1=03. Solve equations: (x ,x ) = (1, 1)25 = 1max1 x2 x2 x121222 PCA optimization problemmax wTSw ws.t. ||w|| = 1 1.DefineLagrangian: L(w, )=wTSw+ (1 wTw) 2. Compute gradient:@L(w, ) =2Sw 2 w @wSw = ⇥w 3. Set to zero:26 PCA optimization problemmax wTSw ws.t. ||w|| = 1 1.DefineLagrangian: L(w, )=wTSw+ (1 wTw)2. Compute gradient:@L(w, ) =2Sw 2 w @wSw = ⇥w 3. Set to zero:This is an Eigenvalue problem!27 Eigenvalue problemsSw = ⇥w (S ⇥ ⇥I)w = 0 28 Eigenvalue problemsSw = ⇥w (S ⇥ ⇥I)w = 0Solutions are found at roots of characteristic polynomialp(⇥) := det(S ⇥ ⇥I) = 0 29 Eigenvalue problemsSw = ⇥w (S ⇥ ⇥I)w = 0Solutions are found at roots of characteristic polynomialp(⇥) := det(S ⇥ ⇥I) = 0 In general: d roots (eigenvalues) ⇥1 , . . . , ⇥dw1,…,wd Corresponding eigenvectors:30 Eigenvalue problemsSw = ⇥w (S ⇥ ⇥I)w = 0Solutions are found at roots of characteristic polynomialp(⇥) := det(S ⇥ ⇥I) = 0 In general: d roots (eigenvalues) ⇥1 , . . . , ⇥d w1,…,wdLeads to the decomposition W SW = S = W W where W = [w1,…,wd],W W = I is orthogonal andis diagonal. 31Corresponding eigenvectors: , ii =⇥i PCA algorithmThe i-th eigenvalue is the variance in the direction of the i-th eigenvector:Var(wiTx)= 1wiTSwi = 1 iwiTwi = 1 i nnn32 PCA algorithmThe i-th eigenvalue is the variance in the direction of the i-th eigenvector:Var(wiTx)= 1wiTSwi = 1 iwiTwi = 1 i nnnThe direction of largest variance corresponds to the largest eigenvector (= the eigenvector with largest eigenvalue).33 PCA algorithmThe i-th eigenvalue is the variance in the direction of the i-th eigenvector:Var(wiTx)= 1wiTSwi = 1 iwiTwi = 1 i nnn34The direction of largest variance corresponds to the largest eigenvector (= the eigenvector with largest eigenvalue).35 Solving PCA with SVD A singular value decomposition factorizes a matrix as: whereU V =MU are the Eigenvectors of MM .V are the Eigenvectors of M M.The square roots of the Eigenvalues of MM are on the diagonal of .37 Solving PCA with SVD A singular value decomposition factorizes a matrix as: whereU V =MU are the Eigenvectors of MM .V are the Eigenvectors of M M.The square roots of the Eigenvalues of MM are on the diagonal of .Instead of calculating the EV-decomposition of S, we can compute the SVD-decomposition of X. This is computationally much more stable.E.g. Läuchli Matrix38SVD in numpy!39 Power iterationThe SVD has computational complexity O(min(n2d, d2n)).But: We often only need a few largest principle components and not all of them.or power iteration The solution is the first eigenvector of A. 40 Power iteration intuitionA = U UT Assume a diagonalizable matrix .The power iteration computes after step k: Akx where x is a random start vector.kkTT WehaveA=U U becauseUU=I.Transform x into a new coordinate system: x ̃ = UTx dd k kT k Xk kX kiA x=U U x=U x ̃= ( ix ̃i)ui = 1 ( kx ̃i)ui k i=1 Ax k!1i=1 1) ||Akx|| !u1 because i 1for 1 ⇥ 2 ⇥···⇥ d ⇥0 141DeflationTo find the second EV, remember S = W WT =Xdi Thus, do power iteration on Sˆ = S 1w1w1T .i=142 iwiwT. Application: Dimensionality ReductionNoisewSignalWe can reduce the dimensionality of our dataset by projecting on the first k principle components.How much signal are we going to loose?ddX Var(wTx) = 1 X ini=k+1number of components Projection on k principle components:0 wTx 1 1B . C=4w1 ··· wk5x @wkTxA | |2| |3Ti=k+143 Application: EigenfacesIdea: Faces look very similar in comparison to other random images. How many principle components would we need to accurately describe all faces?An 64×64 pixel image of a face can be represented as a 4096 dimensional vector where each entry has the pixel’s grayscale value.Reference: Turk, Matthew A and Pentland, Alex P. Face recognition using eigenfaces. Computer Vision and Pattern Recognition, 1991.44 EigenfacesThe principle components are directions in our faces space. Thus, each principle component is a face representation, too.Principle component 1 Principle component 2Principle component 3 Principle component 350 45Variance in first 50 direction Eigenfaces46 Application: Denoising We can use PCA to denoise data:Step 1: Reduce dimensionality to filter out “noise” directions: (w1Tx, . . . , wkTx)TStep 2: Project back into original space:Xk(wiT x)wii=1Step 3: Undo centering:kn X ( w iT x ) w i + X x ii=1 i=1 47 Application: DenoisingMann et al., 201348 Application: EEG artifacts In electroencephalographic (EEG) recordings, eye blink artifacts can be stronger than the neuronal activity. 49BCI2000 Application: EEG artifacts In electroencephalographic (EEG) recordings, eye blink artifacts can be stronger than the neuronal activity. BCI2000→ reasonable to remove first principal components50 How many principal components?51 How many principal components?Heuristics52 How robust is PCA to outliers?Not very robust53!
Reviews
There are no reviews yet.