[SOLVED] CS代考计算机代写 algorithm ant Machine Learning 10-601

30 $

File Name: CS代考计算机代写_algorithm_ant_Machine_Learning_10-601.zip
File Size: 574.62 KB

SKU: 8492564084 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
January 26, 2015
Today:
•BayesClassifiers
•ConditionalIndependence •NaïveBayes
Readings: Mitchell:
“Naïve Bayes and Logistic Regression”
(available on class website)

Two Principles for Estimating Parameters
•MaximumLikelihoodEstimate(MLE):chooseθthat maximizes probability of observed data
•MaximumaPosteriori(MAP)estimate:chooseθthat is most probable given prior probability and the data

Maximum Likelihood Estimate
X=1 X=0 P(X=1) = θ
P(X=0) = 1-θ (Bernoulli)

Maximum A Posteriori (MAP) Estimate
X=1 X=0

Let’s learn classifiers by learning P(Y|X)
Consider Y=Wealth, X=
Gender HrsWorked P(rich | G,HW) P(poor | G,HW)
F <40.5 .09 .91 F >40.5 .21 .79
M <40.5 .23 .77 M >40.5 .38 .62

How many parameters must we estimate?
Suppose X =
where Xi and Y are boolean RV’s
To estimate P(Y| X1, X2, … Xn)
If we have 30 boolean Xi’s: P(Y | X1, X2, … X30)

How many parameters must we estimate?
Suppose X =
where Xi and Y are boolean RV’s
To estimate P(Y| X1, X2, … Xn)
If we have 30 Xi’s instead of 2?

Bayes Rule
Which is shorthand for:
Equivalently:

Can we reduce params using Bayes Rule?
Suppose X =
where Xi and Y are boolean RV’s
How many parameters to define P(X1,… Xn | Y)?
How many parameters to define P(Y)?

Can we reduce params using Bayes Rule?
Suppose X =
where Xi and Y are boolean RV’s

Naïve Bayes
Naïve Bayes assumes
i.e., that Xi and Xj are conditionally independent given Y, for all i≠j

Conditional Independence
Definition: X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z
Which we often write E.g.,

Naïve Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,
Given this assumption, then:

Naïve Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,
Given this assumption, then:
in general:

Naïve Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,
Given this assumption, then:
in general:
How many parameters to describe P(X1…Xn|Y)? P(Y)? •Withoutconditionalindepassumption?
•Withconditionalindepassumption?

Naïve Bayes uses assumption that the Xi are conditionally independent, given Y
Given this assumption, then:
in general:
How many parameters to describe P(X1…Xn|Y)? P(Y)? •Withoutconditionalindepassumption?
•Withconditionalindepassumption?

Naïve Bayes in a Nutshell
Bayes rule:
Assuming conditional independence among Xi’s: So, to pick most probable Y for Xnew = < X1, …, Xn >

Naïve Bayes Algorithm – discrete Xi
•Train Naïve Bayes (examples) for each* value yk
estimate
for each* value xij of each attribute Xi
estimate •Classify (Xnew)
* probabilities must sum to 1, so need estimate only n-1 of these…

Estimating Parameters: Y, Xi discrete-valued Maximum likelihood estimates (MLE’s):
Number of items in dataset D for which Y=yk

Example: Live in Sq Hill? P(S|G,D,M)
•S=1 iff live in Squirrel Hill •D=1 iff Drive to CMU
•G=1 iff shop at SH Giant Eagle •M=1 iff Rachel Maddow fan
What probability parameters must we estimate?

Example: Live in Sq Hill? P(S|G,D,M)
•S=1 iff live in Squirrel Hill
•G=1 iff shop at SH Giant Eagle
P(S=1) : P(D=1 | S=1) : P(D=1 | S=0) : P(G=1 | S=1) : P(G=1 | S=0) : P(M=1 | S=1) : P(M=1 | S=0) :
•D=1 iff Drive to CMU
•M=1 iff Rachel Maddow fan
P(S=0) : P(D=0 | S=1) : P(D=0 | S=0) : P(G=0 | S=1) : P(G=0 | S=0) : P(M=0 | S=1) : P(M=0 | S=0) :

Example: Live in Sq Hill? P(S|G,D,B)
•S=1 iff live in Squirrel Hill •D=1 iff Drive or carpool to CMU
•G=1 iff shop at SH Giant Eagle •B=1 iff Birthday is before July 1
What probability parameters must we estimate?

Example: Live in Sq Hill? P(S|G,D,E)
•S=1 iff live in Squirrel Hill
•G=1 iff shop at SH Giant Eagle
P(S=1) : P(D=1 | S=1) : P(D=1 | S=0) : P(G=1 | S=1) : P(G=1 | S=0) : P(B=1 | S=1) : P(B=1 | S=0) :
•D=1 iff Drive or Carpool to CMU •B=1 iff Birthday is before July 1
P(S=0) : P(D=0 | S=1) : P(D=0 | S=0) : P(G=0 | S=1) : P(G=0 | S=0) : P(B=0 | S=1) : P(B=0 | S=0) :

Naïve Bayes: Subtlety #1
Often the Xi are not really conditionally independent
•We use Naïve Bayes in many cases anyway, and it often works pretty well
–often the right classification, even when not the right probability (see [Domingos&Pazzani, 1996])
•What is effect on estimated P(Y|X)? – Extremecase:whatifweaddtwocopies:Xi =Xk

Extremecase:whatifweaddtwocopies:Xi =Xk

Extremecase:whatifweaddtwocopies:Xi =Xk

Naïve Bayes: Subtlety #2
If unlucky, our MLE estimate for P(Xi | Y) might be zero. (for example, Xi = birthdate. Xi = Jan_25_1992)
•Why worry about just one parameter out of many?
•What can be done to address this?

Naïve Bayes: Subtlety #2
If unlucky, our MLE estimate for P(Xi | Y) might be zero. (e.g., Xi = Birthday_Is_January_30_1992)
•Why worry about just one parameter out of many?
•What can be done to address this?

Estimating Parameters
•MaximumLikelihoodEstimate(MLE):chooseθthat maximizes probability of observed data
•MaximumaPosteriori(MAP)estimate:chooseθthat is most probable given prior probability and the data

Estimating Parameters: Y, Xi discrete-valued Maximum likelihood estimates:
MAP estimates (Beta, Dirichlet priors):
Only difference: “imaginary” examples

Learning to classify text documents
•Classifywhichemailsarespam?
•Classifywhichemailspromiseanattachment?
•Classifywhichwebpagesarestudenthomepages?
How shall we represent text documents for Naïve Bayes?

Baseline: Bag of Words Approach
aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 …
gas 1 …
oil 1 …
Zaire 0

Learning to classify document: P(Y|X) the “Bag of Words” model
•Y discrete valued. e.g., Spam or not
•X= =document
•Xi is a random variable describing the word at position i in the document
•possible values for Xi : any word wk in English
•Document=bagofwords:thevectorofcountsforall wk’s
–like #heads, #tails, but we have many more than 2 values
–assume word probabilities are position independent (i.i.d. rolls of a 50,000-sided die)

Naïve Bayes Algorithm – discrete Xi
•TrainNaïveBayes(examples) for each value yk
estimate
for each value xj of each attribute Xi
estimate •Classify(Xnew)
prob that word xj appears in position i, given Y=yk
* Additional assumption: word probabilities are position independent

MAP estimates for bag of words
Map estimate for multinomial
What β’s should we choose?

For code and data, see
www.cs.cmu.edu/~tom/mlbook.html
click on “Software and Data”

What you should know:
•Training and using classifiers based on Bayes rule
•Conditional independence –What it is
–Why it’s important
•Naïve Bayes –What it is
–Why we use it so much
–Training using MLE, MAP estimates
–Discrete variables and continuous (Gaussian)

Questions:
•HowcanweextendNaïveBayesifjust2oftheXi‘s are dependent?
•WhatdoesthedecisionsurfaceofaNaïveBayes classifier look like?
•WhaterrorwilltheclassifierachieveifNaïveBayes assumption is satisfied and we have infinite training data?
•CanyouuseNaïveBayesforacombinationof discrete and real-valued Xi?

What if we have continuous Xi ? Eg., image classification: Xi is ith pixel

What if we have continuous Xi ? image classification: Xi is ith pixel, Y = mental state
Still have:
Just need to decide how to represent P(Xi | Y)

What if we have continuous Xi ? Eg., image classification: Xi is ith pixel
Gaussian Naïve Bayes (GNB): assume
Sometimes assume σik
•is independent of Y (i.e., σi), •or independent of Xi (i.e., σk) •or both (i.e., σ)

Gaussian Naïve Bayes Algorithm – continuous Xi (but still discrete Y)
•Train Naïve Bayes (examples) for each value yk
estimate*
for each attribute Xi estimate
class conditional mean , variance
•Classify (Xnew)
* probabilities must sum to 1, so need estimate only n-1 parameters…

jth training example
ith feature
kth class
δ(z)=1 if z true, else 0
Estimating Parameters: Y discrete, Xi continuous Maximum likelihood estimates:

GNB Example: Classify a person’s cognitive activity, based on brain image
•are they reading a sentence or viewing a picture? •reading the word “Hammer” or “Apartment”
•viewing a vertical or horizontal line?
•answering the question, or getting confused?

Stimuli for our study:
ant
or 60 distinct exemplars, presented 6 times each

fMRI voxel means for “bottle”: means defining P(Xi | Y=“bottle) fMRI
Mean fMRI activation over all stimuli:
“bottle” minus mean activation:
below average
activation high
average

Rank Accuracy Distinguishing among 60 words

Tools vs Buildings: where does brain encode their word meanings?
Accuracies of cubical 27-voxel Naïve Bayes classifiers centered at each voxel [0.7-0.8]

Expected values
Given discrete random variable X, the expected value of X, written E[X] is
We also can talk about the expected value of functions of X

Covariance
Given two random vars X and Y, we define the covariance of X and Y as
e.g., X=gender, Y=playsFootball or X=gender, Y=leftHanded
Remember:

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS代考计算机代写 algorithm ant Machine Learning 10-601
30 $