PowerPoint Presentation
CS373 Data Mining and
Machine Learning
Lecture 12
Jean Honorio
Purdue University
Except for the first two and last two slides, the presentation was taken from
http://www.cs.cmu.edu/~bapoczos/Classes/ML10715_2015Fall/
Symmetric matrix
Orthonormal matrix
Diagonal matrixof
eigenvalues
Eigenvectors, columns of
Python:
Linear Algebra: Eigen Decomposition
import numpy as np
import numpy.linalg as la
Sigma = np.array([[ 5.2,3.3,-2],
[ 3.3,8.1,4],
[ -2,4,6.5]])
lam, U = la.eig(Sigma)
>>> lam
array([0.38, 7.7 ,11.71])
>>> U
array([[-0.61,0.75,0.25],
[ 0.55,0.18,0.81],
[-0.56, -0.64,0.53]])
>>> np.dot(U.T,U)
array([[1.,0.,0.],
[0.,1.,0.],
[0.,0.,1.]])
>>> np.dot(U,np.dot(np.diag(lam),U.T))
array([[ 5.2,3.3, -2. ],
[ 3.3,8.1,4. ],
[-2. ,4. ,6.5]])
=UUT
UTU = I
U
Matrix
Orthonormal matrices
,
Diagonal matrix of
singular values
Python:
Linear Algebra: Singular Value Decomposition
import numpy as np
import numpy.linalg as la
X = np.array([[ 0.6,-0.7],
[ 2.5,1.9],
[ -1.6,-0.9],
[ -2.8,0.8]])
U, s, Vt = la.svd(X,False)
>>> U
array([[-0.09,0.39],
[-0.69, -0.54],
[ 0.42,0.2 ],
[ 0.58, -0.72]])
>>> s
array([ 4.24,2.13])
>>> Vt
array([[-0.96, -0.27],
[ 0.27, -0.96]])
>>> np.dot(Vt,Vt.T)
array([[ 1.,0.],
[ 0.,1.]])
# U*Diag(s)*Vt
>>> np.dot(U,np.dot(np.diag(s),Vt))
array([[ 0.6, -0.7],
[ 2.5,1.9],
[-1.6, -0.9],
[-2.8,0.8]])
X =USV T
S
UTU = I V TV = I
3
Motivation
4
Data Visualization
Data Compression
Noise Reduction
PCA Applications
5
Data Visualization
Example:
Given 53 blood and urine samples
(features) from 65 people.
How can we visualize the measurements?
6
Matrix format (6553)
H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHCH-MCHC
A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000
A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000
A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000
A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000
A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000
A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000
A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000
A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000
A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000
In
st
an
ce
s
Features
Difficult to see the correlations between the features
Data Visualization
7
Spectral format (65 curves, one for each person)
0 10 20 30 40 50 600
100
200
300
400
500
600
700
800
900
1000
measurement
Va
lu
e
Measurement
Difficult to compare the different patients
Data Visualization
9
0 50 150 250 350 450
50
100
150
200
250
300
350
400
450
500
550
C-Triglycerides
C
-L
D
H
0 100
200300
400500
0
200
400
600
0
1
2
3
4
C-Triglycerides
C-LDH
M
-E
PI
Bi-variate Tri-variate
How can we visualize the other variables???
difficult to see in 4 or higher dimensional spaces
Data Visualization
10
Is there a representation better than the coordinate axes?
Is it really necessary to show all the 53 dimensions?
what if there are strong correlations between the
features?
How could we find
the smallest subspace of the 53-D space that
keeps the most information about the original data?
A solution: Principal Component Analysis
Data Visualization
11
PCA Algorithms
12
Orthogonal projection of the data onto a lower-dimension linear
space that
maximizes variance of projected data (purple line)
minimizes the mean squared distance between
data point and
projections (sum of blue lines)
PCA:
Principal Component Analysis
13
Idea:
Given data points in a d-dimensional space,
project them into a lower dimensional space while
preserving as much information as possible.
Find best planar approximation of 3D data
Find best 12-D approximation of 104-D data
In particular, choose projection that
minimizes squared error
in reconstructing the original data.
Principal Component Analysis
14
PCA Vectors originate from the center of mass.
Principal component #1: points in the direction of
the largest variance.
Each subsequent principal component
is orthogonal to the previous ones, and
points in the directions of the largest
variance of the residual subspace
Principal Component Analysis
Properties:
15
2D Gaussian dataset
16
1st PCA axis
17
2nd PCA axis
18
m
i
i
T
i
T
m 1
2
11
1
2 })]({[
1
maxarg xwwxww
w
}){(
1
maxarg
1
2
i
1
1
m
i
T
m
xww
w
To find w2, we maximize
the variance of the
projection in the residual
subspace
To find w1, maximize the variance of projection of x
x PCA reconstruction
Given the centered data {x1, , xm}, compute the principal vectors:
1st PCA vector
2nd PCA vector
x
w1
w
x=w1(w1Tx)
w
PCA algorithm I (sequential)
x-x w2
19
m
i
k
j
i
T
jji
T
k m 1
2
1
1
1
})]({[
1
maxarg xwwxww
w
We maximize the variance
of the projection in the
residual subspace
Maximize the variance of projection of x
x PCA reconstruction
Given w1,, wk-1, we calculate wk principal vector as before:
kth PCA vector
w1(w1Tx)
w2(w2Tx)
x
w1
w2
x=w1(w1Tx)+w2(w2Tx)
w
PCA algorithm I (sequential)
20
Given data {x1, , xm}, compute covariance matrix 6
PCA basis vectors = the eigenvectors of 6
Larger eigenvalue more important eigenvectors
6
m
i
T
im 1
))((
1
xxxx
m
i
im 1
1
xxwhere
PCA algorithm II
(sample covariance matrix)
i
21
PCA algorithm(X, k): top k eigenvalues/eigenvectors
% X = N u m data matrix,
% each data point xi = column vector, i=1..m
X subtract mean x from each column vector xi in X
6 X XT covariance matrix of X
{ Oi, ui }i=1..N = eigenvectors/eigenvalues of 6
O1 t O2 t t ON
Return { Oi, ui }i=1..k
% top k PCA components
m
im 1
1
ixx
PCA algorithm II
(sample covariance matrix)
24
Singular Value Decomposition of the centered data matrix X.
Xfeatures u samples = USVT
X VT S U =
samples
significant
noise
no
is
e noise
si
gn
if
ic
an
t
sig.
PCA algorithm III
(SVD of the data matrix)
25
Columns of U
the principal vectors, { u(1), , u(k) }
orthogonal and has unit norm so UTU = I
Can reconstruct the data using linear combinations
of { u(1), , u(k) }
Matrix S
Diagonal
Shows importance of each eigenvector
Columns of VT
The coefficients for reconstructing the samples
PCA algorithm III
26
Applications
Want to identify specific person, based on facial image
Robust to glasses, lighting,
Cant just use the given 256 x 256 pixels
Face Recognition
27
Example data set:Images of faces
Eigenface approach
[Turk & Pentland], [Sirovich & Kirby]
Each face x is
256 u 256 values (luminance at location)
x in 256u256(view as 64K dim vector)
Form X = [ x1 , , xm ] centered data
mtx
Compute6= XXT
Problem: 6 is 64K u 64K HUGE!!!
Applying PCA: Eigenfaces
29
256 x 256
real values
m faces
X =
x1, , xm
34
Happiness subspace
35
Disgust subspace
37
Facial Expression Recognition
Movies
Image Compression
40
Divide the original 372492 image into patches:
Each patch is an instance that contains 1212 pixels on a grid
Consider each as a 144-D vector
Original Image
41
L2 error and PCA dim
42
PCA compression: 144D ) 60D
50
Looks like the discrete cosine bases of JPG!
60 most important eigenvectors
PCA Shortcomings
57
PCA doesnt know about class labels!
Problematic Data Set for PCA
58
PCA maximizes variance,
independent of class
magenta
FLD attempts to separate classes
green line
PCA vs Fisher Linear Discriminant
A linear classifier attempts to separate classes
PCA versus linear classifiers
Incorrect way: DO NOT do feature selection (or
dimensionality reduction) on the whole dataset, and then
cross-validation
Feature selection and dimensionality reduction on the whole
dataset destroys cross-validation
reduced training set would depend on the validation set
Thus, training is looking at the supposedly unseen data!
Feature Selection and Cross-Validation
Training
set
Validation
set
Reduced
training
set
Reduced
validation
set
Feature selection / dimensionality reduction
Training
set
Validation
set
Correct way: feature selection (or dimensionality reduction)
inside cross-validation, only applied to the training set
Feature sel. / dim. reduc.
Feature Selection and Cross-Validation
Reviews
There are no reviews yet.