[SOLVED] CS代考计算机代写 decision tree chain algorithm Machine Learning 10-601

30 $

File Name: CS代考计算机代写_decision_tree_chain_algorithm_Machine_Learning_10-601.zip
File Size: 725.34 KB

SKU: 9150394439 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
Today:
•TheBigPicture
•Overfitting
•Review:probability
January 14, 2015
Readings: Decisiontrees,overfiting •Mitchell,Chapter3
Probability review
•BishopCh.1thru1.2.3 •Bishop,Ch.2thru2.2
•AndrewMoore’sonline tutorial

Function Approximation: Problem Setting:
•SetofpossibleinstancesX
•Unknowntargetfunctionf:XàY
•SetoffunctionhypothesesH={h|h:XàY} Input:
•Trainingexamples{ }ofunknowntargetfunctionf Output:
•Hypothesish∈Hthatbestapproximatestargetfunctionf

Function Approximation: Decision Tree Learning Problem Setting:
•SetofpossibleinstancesX
–each instance x in X is a feature vector
x = < x1, x2 … xn>
•Unknowntargetfunctionf:XàY –Y is discrete valued
•SetoffunctionhypothesesH={h|h:XàY} –each hypothesis h is a decision tree
Input:
•Trainingexamples{ }ofunknowntargetfunctionf
Output:
•Hypothesish∈Hthatbestapproximatestargetfunctionf

Information Gain (also called mutual information) between input attribute A and target variable Y
Information Gain is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable A

Function approximation as Search for the best hypothesis
•ID3 performs heuristic search through space of decision trees

Function Approximation: The Big Picture

Which Tree Should We Output?
•ID3 performs heuristic
search through space of decision trees
•It stops at smallest acceptable tree. Why?
Occam’s razor: prefer the simplest hypothesis that fits the data

Why Prefer Short Hypotheses? (Occam’s Razor) Arguments in favor:
Arguments opposed:

Why Prefer Short Hypotheses? (Occam’s Razor)
Argument in favor:
•Fewershorthypothesesthanlongones
à a short hypothesis that fits the data is less likely to be a statistical coincidence
Argument opposed:
•Alsofewerhypothesescontainingaprimenumberof
nodes and attributes beginning with “Z”
•What’ssospecialabout“short”hypotheses,instead of “prime number of nodes and edges”?

Overfitting
Consider a hypothesis h and its •Error rate over training data: •True error rate over all data:

Overfitting
Consider a hypothesis h and its •Error rate over training data: •True error rate over all data:
We say h overfits the training data if Amount of overfitting =

Split data into training and validation set
Create tree that classifies training set correctly

Decision Tree Learning, Formal Guarantees

Supervised Learning or Function Approximation
Data Source
Distribution D on X
Expert / Oracle
Learning Algorithm
Labeled Examples
(x1,c*(x1)),…, (xm,c*(xm))
Alg.outputs
h:X!Y
x1 > 5
+1
c* : X ! Y
+1
x6 > 2
-1

Supervised Learning or Function Approximation
Learning Algorithm
Data Source
Labeled Examples
Distribution D on X
Expert/Oracle
(x1,c*(x1)),…, (xm,c*(xm))
c* : X ! Y
Alg.outputs
•Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D
h:X!Y
•Does optimization over S, finds hypothesis h (e.g., a decision tree).
•Goal: h has small error over D.
err(h)=Prx 2 D(h(x) ≠ c*(x))

Two Core Aspects of Machine Learning
Algorithm Design. How to optimize?
Computation
Automatically generate rules that do well on observed data.
Confidence Bounds, Generalization
(Labeled) Data
Confidence for rule effectiveness on future data.


Very well understood: Occam’s bound, VC theory, etc.
Decision trees: if we were able to find a small decision tree that explains data well, then good generalization guarantees.
•NP-hard [Hyafil-Rivest’76]

Top Down Decision Trees Algorithms
•Decision trees: if we were able to find a small decision tree consistent with the data, then good generalization guarantees.
•NP-hard [Hyafil-Rivest’76]
•Very nice practical heuristics; top down algorithms, e.g, ID3
•Natural greedy approaches where we grow the tree from the root to the leaves by repeatedly replacing an existing leaf with an internal node.
••

Key point: splitting criterion.
ID3: split the leaf that decreases the entropy the most.
Why not split according to error rate — this is what we care about after all?
•There are examples where we can get stuck in local minima!!!

Entropy as a better splitting measure
000
001
010
011
100
101
110
111
Initial error rate is 1/4 (25% positive, 75% negative)
Error rate after split is (left leaf is 100% negative; right leaf is 50/50)
Overall error doesn’t decrease!

Entropy as a better splitting measure
000
001
010
011
100
101
110
111
Initial entropy is Entropy after split is
Entropy decreases!


Top Down Decision Trees Algorithms
Natural greedy approaches where we grow the tree from the root to the leaves by repeatedly replacing an existing leaf with an internal node.
••

Key point: splitting criterion.
ID3: split the leaf that decreases the entropy the most.
Why not split according to error rate — this is what we care about after all?
•There are examples where you can get stuck!!!

[Kearns-Mansour’96]: if measure of progress is entropy, we can always guarantees success under some formal relationships between the class of splits and the target (the class of splits can weakly approximate the target function).
•Provides a way to think about the effectiveness of various top down algos.

Top Down Decision Trees Algorithms
•Key: strong concavity of the splitting crieterion Pr[c*=1]=q
v
h
Pr[h=0]=u
Pr[h=1]=1-u
01
v1 Pr[c*=1| h=0]=p
v2
Pr[c*=1| h=1]=r

••
pqr
q=up + (1-u) r. Want to lower bound: G(q) – [uG(p) + (1-u)G(r)]
If: G(q) =min(q,1-q) (error rate), then G(q) = uG(p) + (1-u)G(r)
If: G(q) =H(q) (entropy), then G(q) – [uG(p) + (1-u)G(r)] >0 if r-p> 0 and u ≠1, u ≠0 (this happens under the weak learning assumption)

Two Core Aspects of Machine Learning
Algorithm Design. How to optimize?
Computation
Automatically generate rules that do well on observed data.
Confidence Bounds, Generalization
(Labeled) Data
Confidence for rule effectiveness on future data.

What you should know:
•Wellposedfunctionapproximationproblems: –Instance space, X
–Sample of labeled training data { }
–Hypothesis space, H = { f: XàY }
•Learningisasearch/optimizationproblemoverH
–Various objective functions
•minimize training error (0-1 loss)
•among hypotheses that minimize training error, select smallest (?)
–But inductive learning without some bias is futile !
•Decisiontreelearning
–Greedy top-down learning of decision trees (ID3, C4.5, …) –Overfitting and tree post-pruning
–Extensions…

Extra slides
extensions to decision tree learning

Questions to think about (1)
•ID3 and C4.5 are heuristic algorithms that search through the space of decision trees. Why not just do an exhaustive search?

Questions to think about (2)
•Consider target function f: ày, where x1 and x2 are real-valued, y is boolean. What is the set of decision surfaces describable with decision trees that use each attribute at most once?

Questions to think about (3)
•Why use Information Gain to select attributes in decision trees? What other criteria seem reasonable, and what are the tradeoffs in making this choice?

Questions to think about (4)
•What is the relationship between learning decision trees, and learning IF-THEN rules

Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
Today:
•Review:probability
many of these slides are derived from William Cohen, AndrewMoore,AartiSingh, Eric Xing. Thanks!
Readings:
Probability review
•BishopCh.1thru1.2.3 •Bishop,Ch.2thru2.2
•AndrewMoore’sonline tutorial
January 14, 2015

Probability Overview
•Events
–discrete random variables, continuous random variables,
compound events
•Axiomsofprobability
–What defines a reasonable theory of uncertainty •Independentevents
•Conditionalprobabilities
•Bayesruleandbeliefs
•Jointprobabilitydistribution
•Expectations
•Independence,Conditionalindependence

Random Variables
•Informally, A is a random variable if
–A denotes something about which we are uncertain –perhaps the outcome of a randomized experiment
•Examples
A = True if a randomly drawn person from our class is female
A = The hometown of a randomly drawn person from our class
A = True if two randomly drawn persons from our class have same birthday
•Define P(A) as “the fraction of possible worlds in which A is true” or
“the fraction of times A holds, in repeated runs of the random experiment”
–the set of possible worlds is called the sample space, S –A random variable A is a function defined over S
A: Sà{0,1}

A little formalism
More formally, we have
•asamplespaceS(e.g.,setofstudentsinourclass)
–aka the set of possible worlds
•arandomvariableisafunctiondefinedoverthesample space
–Gender: Sà{ m, f } –Height: SàReals
•aneventisasubsetofS
–e.g., the subset of S for which Gender=f
–e.g., the subset of S for which (Gender=m) AND (eyeColor=blue)
•we’reofteninterestedinprobabilitiesofspecificevents
•andofspecificeventsconditionedonotherspecificevents

Visualizing A
Sample space of all possible worlds
Its area is 1
P(A) = Area of reddish oval
Worlds in which A is true
Worlds in which A is False

The Axioms of Probability
•0<=P(A)<=1•P(True)=1•P(False)=0•P(AorB)=P(A)+P(B)-P(AandB)[di Finetti 1931]:when gambling based on “uncertainty formalism A” you can be exploited by an opponentiffyour uncertainty formalism A violates these axioms Elementary Probability inPictures•P(~A) + P(A) = 1A~A A useful theorem•0<=P(A)<=1,P(True)=1,P(False)=0, P(A or B) = P(A) + P(B) – P(A and B)èP(A) = P(A ^ B) + P(A ^ ~B)A= [Aand(Bor~B)] = [(AandB)or(Aand~B)]P(A) = P(A and B) + P(A and ~B) – P((A and B) and (A and ~B)) P(A) = P(A and B) + P(A and ~B) – P(A and B and A and ~B)Elementary Probability in Pictures•P(A)=P(A^B)+P(A^~B) A^BBA ^ ~B Definition of Conditional ProbabilityP(A ^ B) P(A|B) = ———–P(B)BA Definition of Conditional ProbabilityP(A ^ B) P(A|B) = ———–P(B)Corollary: The Chain RuleP(A ^ B) = P(A|B) P(B)Bayes Rule•let’s write 2 expressions for P(A ^ B) A^BBAP(A|B) = P(B|A) * P(A) P(B)we call P(A) the “prior” and P(A|B) the “posterior”Bayes’ rule …by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 Other Forms of Bayes RuleP(A|B) = P(B| A)P(A)P(B| A)P(A)+P(B|~ A)P(~ A)P(A|B∧ X) = P(B| A∧ X)P(A∧ X) P(B∧X) Applying Bayes RuleP(A |B) = P(B | A)P(A)P(B| A)P(A)+P(B|~ A)P(~ A)A = you have the flu, B = you just coughedAssume:P(A) = 0.05 P(B|A) = 0.80 P(B| ~A) = 0.2what is P(flu | cough) = P(A|B)?what does all this have to do with function approximation? The Joint DistributionRecipe for making a joint distribution of M variables:Example: Boolean variables A, B, C The Joint DistributionRecipe for making a joint distribution of M variables:1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).Example: Boolean variables A, B, C ABC000001010011100101110111 The Joint DistributionRecipe for making a joint distribution of M variables:1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).2.For each combination of values, say how probable it is.Example: Boolean variables A, B, C ABCProb0000.300010.050100.100110.051000.051010.101100.251110.10 The Joint DistributionRecipe for making a joint distribution of M variables:1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).2.For each combination of values, say how probable it is.3.If you subscribe to the axioms of probability, those numbers must sum to 1. A00001111B00110011C01010101Prob0.300.050.100.050.050.100.250.100.300.05A0.250.100.05 0.100.05C B 0.10Using the Joint DistributionOne you have the JD you can ask for the probability of any logical expression involving your attributeP(E) =∑P(row) rows matching EUsing the JointP(Poor Male) = 0.4654P(E) = ∑P(row) rows matching E Using the Joint P(Poor) = 0.7604P(E) = ∑P(row) rows matching EInference with the JointP(E | E ) = 12P(E ∧E ) 1 2P(E2 )∑P(row) = rows matching E1 and E2P(row) rows matching E2∑Inference with the Joint∑P(row) = rows matching E1 and E2P(E2 )P(Male | Poor) = 0.4654 / 0.7604 = 0.612P(E | E ) = 12P(E ∧E ) 1 2P(row) rows matching E2∑You should know•Events–discrete random variables, continuous random variables,compound events•Axiomsofprobability–What defines a reasonable theory of uncertainty •Conditionalprobabilities•Chainrule•Bayesrule•Jointdistributionovermultiplerandomvariables–how to calculate other quantities from the joint distribution Expected valuesGiven discrete random variable X, the expected value of X, written E[X] isWe also can talk about the expected value of functions of X CovarianceGiven two discrete r.v.’s X and Y, we define the covariance of X and Y ase.g., X=gender, Y=playsFootball or X=gender, Y=leftHandedRemember:

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS代考计算机代写 decision tree chain algorithm Machine Learning 10-601
30 $