[SOLVED] decision tree Bayesian algorithm AI L19 Unsupervised Learning and Clustering

$25

File Name: decision_tree_Bayesian_algorithm_AI_L19__Unsupervised_Learning_and_Clustering.zip
File Size: 725.34 KB

5/5 - (1 vote)

L19 Unsupervised Learning and Clustering

EECS 391
Intro to AI

Unsupervised Learning and Clustering

L19 Tue Nov 13

1 2 3 4 5 6 7

0.5

1.0

1.5

2.0

2.5

petal length (cm)

pe
ta

l w
id

th
(c

m
)

Fishers Iris data (unlabeled)

1 2 3 4 5 6 7
0

0.5

1

1.5

2

2.5

petal length (cm)

pe
ta

l w
id

th
(c

m
)

Iris virginica

Iris setosa Iris versicolor

1 2 3 4 5 6 7
0

0.5

1

1.5

2

2.5

petal length (cm)

pe
ta

l w
id

th
(c

m
)

Iris virginica

Iris setosa Iris versicolor

Decision Tree:
Dim 2

Decision Tree:
Dim 1

1 2 3 4 5 6 7
0

0.5

1

1.5

2

2.5

petal length (cm)

pe
ta

l w
id

th
(c

m
)

Fishers Iris data

Iris virginica

Iris setosa Iris versicolor

In which example would
you be more confident

about the class?

Decision boundaries
provide a classification
but not uncertainty.

The general classification problem

Data
D = {x1, . . . ,xT }

xi = {x1, . . . , xN}i

desired output
y = {y1, . . . , yK}

model
= {1, . . . , M}

Given data, we want to learn a model that
can correctly classify novel observations.

yi =

1 if xi Ci class i,
0 otherwise

output is a binary classification vector:

input is a set of T observations,
each an N-dimensional vector
(binary, discrete, or continuous)

model (e.g. a decision tree) is
defined by M parameters.

How do we
approach this

probabilistically?

The answer to all questions of uncertainty

Lets apply Bayes rule to infer the most probable class given the observation:

This is the answer, but what does it mean?
How do we specify the terms?

p(Ck) is the prior probability on the different classes
p(x|Ck) is the data likelihood, ie probability of x given class Ck

How should we define this?

p(Ck|x) =
p(x|Ck)p(Ck)

p(x)

=
p(x|Ck)p(Ck)
k p(x|Ck)p(Ck)

What classifier would give optimal performance?

Consider the iris data.
How would we minimize the number

of future mis-classifications?

We would need to know the true
distribution of the classes.

Assume they follow a Gaussian
distribution.

The number of samples in each class
is the same (50), so (assume) p(Ck)
is equal for all classes.

Because p(x) is the same for all
classes we have:

1 2 3 4 5 6 7
0

0.5

1

1.5

2

2.5

petal length (cm)

pe
ta

l w
id

th
(c

m
)

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9p(petal length |C2)

p(petal length |C3)

p(Ck|x) =
p(x|Ck)p(Ck)

p(x)
p(x|Ck)p(Ck)

1 2 3 4 5 6 7
0

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

Where do we put the boundary?

p(petal length |C2) p(petal length |C3)

1 2 3 4 5 6 7
0

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

Where do we put the boundary?

decision boundary

R32 = C3 is misclassified as C2

R23 = C2 is misclassified as C3

p(petal length |C2) p(petal length |C3)

1 2 3 4 5 6 7
0

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

Where do we put the boundary?

Shifting the boundary
trades-off the two errors.

R32 = C3 is misclassified as C2

R23 = C2 is misclassified as C3

p(petal length |C2) p(petal length |C3)

The misclassification error is defined by

which in our case is proportional to the data likelihood

1 2 3 4 5 6 7
0

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

Where do we put the boundary?

R32 = C3 is misclassified as C2

R23 = C2 is misclassified as C3

p(petal length |C2) p(petal length |C3)

p(error) =

Z

R32

p(x|C3)P (C3)dx+
Z

R23

p(x|C2)P (C2)dx

The misclassification error is defined by

which in our case is proportional to the data likelihood

1 2 3 4 5 6 7
0

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

Where do we put the boundary?

This region would yield

but were still classifying
this region as C2!

p(C3|x) > p(C2|x)

p(petal length |C2) p(petal length |C3)

p(error) =

Z

R32

p(x|C3)P (C3)dx+
Z

R23

p(x|C2)P (C2)dx

The minimal misclassification error at the point where

1 2 3 4 5 6 7
0

0.2
0.4
0.6
0.8
1

The optimal decision boundary

Optimal decision boundary

p(petal length |C2) p(petal length |C3)

p(C3|x) = p(C2|x)
p(x|C3)p(C3)/p(x) = p(x|C2)p(C2)/p(x)
p(x|C3) = p(x|C2)

p(C2 | petal length) p(C3 | petal length)

Note: this assumes we
have only two classes.

Clustering: Classification without labels

In many situations we dont have labeled training data, only unlabeled data.
Eg, in the iris data set, what if we were just starting and didnt know any classes?

1 2 3 4 5 6 7
0

0.5

1

1.5

2

2.5

petal length (cm)

pe
ta

l w
id

th
(c

m
)

Types of learning

world
(or data)

model
{1, . . . , n}

desired output
{y1, . . . , yn}

supervised

world
(or data)

model
{1, . . . , n}

unsupervised

world
(or data)

model
{1, . . . , n}

model output

reinforcement

reinforcement

A different approach to classification

Nearby points are likely to be
members of the same class.

What if we used the points
themselves to classify?

classify x in Ck if x is similar to
a point we already know is in Ck.

Eg: unclassified point x is more
similar Class 2 than Class 1.

Issue: How to define similar ?
Simplest is Euclidean distance:

Could define other metrics
depending on application, e.g.
text documents, images, etc.

1 2 3 4 5 6 7
0

0.5

1

1.5

2

2.5

x1
x 2

x

Class 1

Class 2

Class 3

Potential advantages:
dont need an explicit model
the more examples the better
might handle more complex classes
easy to implement
no brain on part of the designer

Nearest neighbor classification on the iris dataset

d(x,y) =

i

(xi yi)2

Example: Handwritten digits

Use Euclidean distance to see which
known digit is closest to each class.

But not all neighbors are the same:

k-nearest neighbors:
look at k-nearest neighbors and
choose most frequent.

Cautions: can get expensive to find
neighbors

from LeCun etal, 1998
digit data available at:
http://yann.lecun.com/exdb/mnist/

Error Bounds for NN 8

Amazing fact: asymptotically, err(1-NN) < 2 err(Bayes):eB e1NN 2eB MM 1e2Bthis is a tight upper bound, achieved in the zero-information casewhen the classes have identical densities. For K-NN there are also bounds. e.g. for two classes and odd K:eB eKNN (K1)/2!i=0″ki#$ei+1B (1 eB)ki + ekiB (1 eB)i+1% For more on these bounds, see the book A Probabilistic Theory ofPattern Recognition, by L. Devroye, L. Gyorfi & G. Lugosi (1996).Example: USPS Digits 9 Take 16×16 grayscale images (8bit) of handwritten digits. Use Euclidean distance in raw pixel space (dumb!) and 7-nn. Classification error (leave-one-out): 4.85%.Example 7 Nearest NeighboursNonparametric (Instance-Based) Models 10Q: What are the parameters in K-NN? What is the complexity?A: the scalar K and the entire training set.Models which need the entire training set at test time but(hopefully) have very few other parameters are known asnonparametric, instance-based or case based.What if we want a classifier that uses only a small number ofparameters at test time? (e.g. for speed or memory reasons)Idea 1: single linear boundary, of arbitrary orientationIdea 2: many boundaries, but axis-parallel & tree structured1 2 3 4 5 6 7 8 9 1012345678910x1x2x1x2t1 t2t3t4t5ABCDEFLinear Classification for Binary Output 11 Goal: find the line (or hyperplane) which best separates two classes:c(x) = sign[xw&'()weight w0&'()threshold]w is a vector perpendicular to decision boundary This is the opposite of non-parametric: only d + 1 parameters! Typically we augment x with a constant term 1 (bias unit) andthen absorb w0 into w, so we dont have to treat it specially.1 2 3 4 5 6 7 8 9 1012345678910x1x2example nearest neighborsexample from Sam Roweis Digits are just represented as a vector.http://yann.lecun.com/exdb/mnist/The problem of using templates (ie Euclidean distance) Which of these is more like the example?A or B? Euclidean distance only cares about how many pixels overlap. Could try to define a distance metric that is insensitive to small deviations in position, scale, rotation, etc. Digit example: – 60,000 training images, – 10,000 test images- no preprocessingexample A Bfrom Simard etal, 1998Classifier error rate on test data (%)linear 12k=3 nearest neighbor(Euclidean distance)52-layer neural network(300 hidden units)4.7nearest neighbor(Euclidean distance)3.1k-nearest neighbor (improved distance metric)1.1convolutional neural net 0.95best (the conv. net with elastic distortions)0.4humans 0.2 – 2.5performance results of various classifiers(from http://yann.lecun.com/exdb/mnist/)http://yann.lecun.com/exdb/mnist/A real example: clustering electrical signals from neuronsAn application of PCA: Spike sortingoscilloscopesoftwareanalysiselectrodefiltersamplifierA/DPrincipal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ? 5An extracellular waveform with many dierent spikes0 5 10 15 20 25msecHow do we sort the dierent spikes?Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ? 6Basic problem: only information is signal.The true classes are always unknown.Sorting with level detection on an oscilloscopeSorting with level detection0.5 0 0.5 1 1.5msecPrincipal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ? 7Sorting with level detection0.5 0 0.5 1 1.5msec0.5 0 0.5 1 1.5msecLevel detection doesnt always work.Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ? 7Why level detection doesnt workWhy level detection doesnt workbackgroundamplitudeA Bamplitudepeak amplitude: neuron 2peak amplitude:neuron 1One dimension is not sucient to separate the spikes.Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ? 8Sorting with level detection0.5 0 0.5 1 1.5msecPrincipal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ? 7Sorting with level detection0.5 0 0.5 1 1.5msec0.5 0 0.5 1 1.5msecLevel detection doesnt always work.Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ? 7Idea: try more featuresUsing multiple features0.5 0 0.5 1 1.5msecWhat other features could we use?Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ? 9max amplitudemin amplitudeMaximum vs minimumMaximum vs minimum200 150 100 50 0050100150200250spike minimum (V)spike maximum (V)This allows better discrimination than max alone, but is it optimal?Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU ? 10Features using Principal Components (not covered)100 0 100 200 30020010001002003001st PC score2nd PC scorek-means clustering Idea: try to estimate k cluster centers by minimizing distortion Define distortion as: rnk is 1 for the closest cluster mean to xn. Each point xn is the minimum distance from its closet center. How do we learn the cluster means? Need to derive a learning rule.D =Nn=1Kk=1rnk xn k 2rnk = 1 if xn cluster k, 0 otherwise.Deriving a learning rule for the cluster means Our objective function is: Differentiate w.r.t. to the mean (the parameter we want to estimate): We know the optimum is when Here, we can solve for the mean: This is simply a weighted mean for each cluster. Thus we have a simple estimation algorithm (k-means clustering)1. select k points at random2. estimate (update) means3. repeat until converged convergence (to a local minimum) is guaranteedD =Nn=1Kk=1rnk xn k 2Dk= 2Nn=1rnk(xn k)Dk= 2Nn=1rnk(xn k) = 0k =n rnkxnn rnkk-means clustering example100 0 100 200 30020010001002003001st PC score2nd PC scoreSelect 3 points at random for cluster means100 0 100 200 30020010001002003001st PC score2nd PC scorek-means clustering example100 0 100 200 30020010001002003001st PC score2nd PC scoreThe update them using the estimate.k-means clustering example100 0 100 200 30020010001002003001st PC score2nd PC scoreAnd iterate…k-means clustering example100 0 100 200 30020010001002003001st PC score2nd PC scorek-means clustering example100 0 100 200 30020010001002003001st PC score2nd PC scoreStop when converged, ie no change.An example of a local minimum0 100 200 300 10020010001002003001st PC score2nd PC scoreThere can be multiple local minima.A probabilistic interpretation: Gaussian mixture models Weve already seen a one-dimensional version This example has three classes: neuron 1, neuron 2, and background noise. Each can be modeled as a Gaussian Any given data point comes from just one Gaussian The whole set of data is modeled by a mixture of three Gaussians How do we model this?R58 M S LewickibackgroundamplitudeA Bamplitudepeak amplitude: neuron 2peak amplitude:neuron 1Figure 4. The figure illustrates the distribution of amplitudes for the background activity and thepeak amplitudes of the spikes from two units. Amplitude is along the horizontal axis. Setting thethreshold level to the position at A introduces a large number of spikes from unit 1. Increasingthe threshold to B reduces the number of spikes that are misclassified, but at the expense ofmany missed spikes.3.2. Types of detection errorsVery often it is not possible to separate the desired spikes from the background noise withperfect accuracy. The threshold level determines the trade-off between missed spikes (falsenegatives) and the number of background events that cross threshold (false positives), whichis illustrated in figure 4. If the threshold is set to the level at A, all of the spikes from unit1 are detected, but there is a very large number of false positives due the contamination ofspikes from unit 2. If the threshold is increased to the level at B, only spikes from unit 1are detected, but a large number fall below threshold. Ideally, the threshold should be set tooptimize the desired ratio of false positives to false negatives. If the background noise levelis small compared to the amplitude of the spikes and the amplitude distributions are wellseparated, then both of these errors will be close to zero and the position of the thresholdhardly matters.3.3. Misclassification error due to overlapsIn addition to the background noise, which, to first approximation, is Gaussian in nature(we will have more to say about that below), the spike height can vary greatly if there areother neurons in the local region that generate action potentials of significant size. If thepeak of the desired unit and the dip of a background unit line up, a spike will be missed.This is illustrated in figure 5.How frequently this will occur depends on the firing rates of the units involved. Arough estimate for the percentage of error due to overlaps can be calculated as follows.The percentage of missed spikes, like the one shown in figure 5(b), is determined by theprobability that the peak of the isolated spike will occur during the negative phase of thebackground spike, which is expressed as%missed spikes = 100rd/1000 (1)where r is the firing rate in hertz and d is the duration of the negative phase in milliseconds.Thus if the background neuron is firing at 20 Hz and the duration of the negative phase isapproximately 0.5 ms, then approximately 1% of the spikes will be missed. Note that thisis only a problem when the negative phase of the background spikes is sufficiently large tocause the spikes of interest to drop below threshold.The Gaussian mixture model density The likelihood of the data given a particular class ck is given byp(x|ck, k,k) x is the spike waveform, k and k are the mean and covariance for class ck. The marginal likelihood is computed by summing over the likelihood of the K classesp(x|1:K) =KXk=1p(x|ck, k)p(ck) 1:K defines the parameters for all of the classes, 1:K = {1,1, . . . , K,K}. p(ck) is the probability of the kth class, withPk p(ck) = 1. What does this mean in this example? How do we determine the class ck from the data x ? Again use Bayes rulep(ck|x(n), 1:K) = pk,n =p(x(n)|ck, k)p(ck)Pk p(x(n)|ck, k)p(ck) This tells is the probability that waveform x(n) came from class ck.Bayesian classification with multivariate Gaussian mixturesEstimating the parameters: fitting the model density to the data The objective of density estimation is to maximize the likelihood of the data the data If we assume the samples are independent, the data likelihood is just the product ofthe marginal likelihoodsp(x1:N|1:K) =NYn=1p(xn|1:K) The class parameters are determined by optimization. Is far more practical to optimize the log-likelihood. One elegant approach to this is the EM algorithm.The Gaussian mixture EM stands for Expectation-Maximization, and involves two steps that are iterated.For the case of a Gaussian mixture model:1. E-step: Compute pn,k = p(ck|x(n), 1:K). Let pk =Pn pi,n2. M-step: Compute new mean, covariance, and class prior for each class:kXnpn,kx(n)/pkkXnpn,k(x(n) k)(x(n) k)T/pkp(ck) pk This is just the sample mean and covariance, weighted by the class conditionalprobabilities pn,k. Derived by solving setting log-likelihood gradient to zero (i.e. the maximum).200 0 200 400 60040030020010001002003004001st component score2nd component scoreFour cluster solution with decision boundariesBut wait!Heres a nine cluster solutionR64 M S Lewicki200 0 200 400 60040030020010001002003004001st component score2nd component score200 0 200 400 60040030020010001002003004001st component score2nd component score(a) (b)Figure 9. Application of Gaussian clustering to spike sorting. (a) The ellipses show the three-sigma error contours of the four clusters. The lines show the Bayesian decision boundariesseparating the larger clusters. (b) The same data modelled with nine clusters. The elliptical lineextending across the bottom is the three-sigma error contour of the largest cluster.Classification is performed by calculating the probability that a data point belongs toeach of the classes, which is obtained with Bayes rulep(ck|x, 1:K) =p(x|ck, k)p(ck)k p(x|ck, k)p(ck). (5)This implicitly defines the Bayesian decision boundaries for the model. Because the clustermembership is probabilistic, the cluster boundaries can be computed as a function ofconfidence level. This will yield better classification, because if the model is accuratethe boundaries will be optimal, i.e. the fewest number of misclassifications.The class parameters are optimized by maximizing the likelihood of the datap(x1:N|1:K) =Nn=1p(xn|ck, 1:K) . (6)For the examples shown here, the cluster parameters were obtained using the publiclyavailable software package AutoClass (Cheeseman and Stutz 1996). This package uses theBayesian methods described above to determine both the means and the covariance matricesas well as the class probabilities, p(ck).The ellipses (or circles) in figure 9 show the three-sigma (three standard deviations) errorcontours of the Gaussian model for each cluster. The figure illustrates two different modelsof the data, one with four clusters and one with nine. In this case, the clusters correspondingto the two large spikes appear in both solutions, but this illustrates that choosing the numberof clusters is not always an easy task. This issue will be discussed further below. Notethat the cluster in the middle, corresponding to the background spikes, is not modelled bya single class, but by two or more overlapping Gaussians.The lines in figure 9(a) shows the Bayesian decision boundaries that separate the threelarger clusters. The decision boundary for the smaller circular cluster is not shown, but itis in roughly the same position as the clusters error contour.If the Gaussian cluster model were accurate, most of the data would fall within thethree-sigma error boundary. In this case, three of the contours match the variability in thedata, but the upper-right contour is significantly larger than the cluster itself. The reasonUh oh…How many clusters are there really?1 2 3 4 5 6 7 8 9 1012345678910k = number of clustersDistortionHow do we choose k? Increasing k, will always decrease our distortion. This will overfit the data.- How can we avoid this?- Or how do we choose the best k? One way: cross validation Use our distortion metric: Then just measure the distortion on a test data set, and stop when we reach a minimum.D =Nn=1Kk=1rnk xn k 2100 0 100 200 30020010001002003001st PC score2nd PC scorek=10 clustersoverfittingtest set errortraining set error

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] decision tree Bayesian algorithm AI L19 Unsupervised Learning and Clustering
$25