Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
January 12, 2015
Today:
•What is machine learning? •Decisiontreelearning
•Courselogistics
Readings:
•“The Discipline of ML” •Mitchell,Chapter3
•Bishop,Chapter14.4
Machine Learning:
Study of algorithms that
•improve their performance P •at some task T
•with experience E
well-defined learning task:
1
Learning to Predict Emergency C-Sections
[Sims et al., 2000]
9714 patient records, each with 215 features
Learning to classify text documents
spam vs
not spam
2
Learning to detect objects in images
(Prof. H. Schneiderman)
Example training images for each orientation
Learn to classify the word a person is thinking about, based on fMRI brain activity
3
Learning prosthetic control from neural implant
[R. Kass L. Castellanos A. Schwartz]
Machine Learning – Practice
Mining Databases
Text analysis
Speech Recognition
Control learning
Object recognition
•Support Vector Machines •Bayesian networks
•Hidden Markov models
•Deep neural networks
•Reinforcement learning •….
4
Machine Learning – Theory
Other theories for
•Reinforcement skill learning
•Semi-supervised learning • Active student querying • …
PAC Learning Theory (supervised concept learning)
# examples (m) representational
error rate (ε)
complexity (H)
failure probability (δ)
… also relating:
•# of mistakes during learning
•learner’s query strategy •convergence rate
•asymptotic performance •bias, variance
Machine Learning in Computer Science
•Machine learning already the preferred approach to –Speech recognition, Natural language processing
–Computer vision
–Medical outcomes analysis
–Robot control –…
•This ML niche is growing (why?)
ML apps.
All software apps.
5
Machine Learning in Computer Science
•Machine learning already the preferred approach to –Speech recognition, Natural language processing
–Computer vision
–Medical outcomes analysis
–Robot control –…
ML apps.
All software apps.
•This ML niche is growing
–Improved machine learning algorithms
–Increased volume of online data
–Increased demand for self-customizing software
Tom’s prediction: ML will be fastest-growing part of CS this century
Economics and Organizational Behavior
Evolution
Computer science
Machine learning
Statistics
Animal learning (Cognitive science, Psychology, Neuroscience)
Adaptive Control Theory
6
What You’ll Learn in This Course
•The primary Machine Learning algorithms
–Logistic regression, Bayesian methods, HMM’s, SVM’s, reinforcement learning, decision tree learning, boosting, unsupervised clustering, …
•How to use them on real data –text, image, structured data
–your own project
•Underlying statistical and computational theory
•Enough to read and understand ML research papers
Course logistics
7
Machine Learning 10-601
website: www.cs.cmu.edu/~ninamf/courses/601sp15
Faculty
•MariaBalcan •TomMitchell
TA’s
•TravisDick
•KirstenEarly
•AhmedHefny
•MicolMarchetti-Bowick •WillieNeiswanger
•AbuSaparov
Course assistant
•SharonCavlovich
See webpage for
•Officehours
•Syllabusdetails
•Recitationsessions •Gradingpolicy
•Honestypolicy
•Latehomeworkpolicy •Piazzapointers
•…
Highlights of Course Logistics
On the wait list?
•Hang in there for first few weeks Homework 1
•Available now, due friday
Grading:
•30% homeworks (~5-6)
•20% course project
•25% first midterm (March 2)
•25% final midterm (April 29)
Academic integrity:
•CheatingàFail class, be expelled from CMU
Late homework:
•full credit when due
•half credit next 48 hrs
•zero credit after that
•we’ll delete your lowest HW score
•must turn in at least n-1 of the n homeworks, even if late
Being present at exams:
•You must be there – plan now.
•Two in-class exams, no other final
8
Maria-Florina Balcan: Nina
•FoundationsforModernMachineLearning
•
E.g., interactive, distributed, life-long learning
Theoretical Computer Science, especially connections between learning theory & other fields
•
Approx. Algorithms
Control Theory
Game Theory
Machine Learning Theory
Mechanism Design
Discrete Optimization
Matroid Theory
Travis Dick
•Whencanwelearnmanyconcepts from mostly unlabeled data by exploiting relationships between between concepts.
•Currently:Geometricrelationships
9
Kirstin Early
•Analyzingandpredicting energy consumption
•Reducecosts/usageandhelp people make informed decisions
Predicting energy costs
from features of home and occupant behavior
Energy disaggregation:
decomposing total electric signal into individual appliances
Ahmed Hefny
•Howcanwelearntotrackandpredictthestateofa dynamical system only from noisy observations ?
•Can we exploit supervised learning methods to devise a flexible, local minima-free approach ?
observations (oscillating pendulum) Extracted 2D state trajectory
10
Micol Marchetti-Bowick
How can we use machine learning for biological and medical research?
•Usinggenotypedatatobuildpersonalized models that can predict clinical outcomes
•Integratingdatafrommultiplesourcesto perform cancer subtype analysis
•Structuredsparseregressionmodelsfor genome-wide association studies
sample weight
genetic relatedness
xxxxxxxxx yyyyyyyyy
Gene expression data w/ dendrogram (or have one picture per task)
Willie Neiswanger
•Ifwewanttoapplymachinelearning algorithms to BIG datasets…
•Howcanwedevelopparallel,low-communicationmachine learning algorithms?
•Suchasembarrassinglyparallelalgorithms,wheremachines work independently, without communication.
11
Abu Saparov
•How can knowledge about the world help computers understand natural language?
•What kinds of machine learning tools are needed to understand sentences?
“Carolyn ate the cake with a fork.” “Carolyn ate the cake with vanilla.”
person_eats_food
consumer
Carolyn
food
cake
instrument
fork
person_eats_food
consumer
Carolyn
food
cake
topping
vanilla
Tom Mitchell
How can we build never-ending learners? Case study: never-ending language learner
(NELL) runs 24×7 to learn to read the web
see http://rtw.ml.cmu.edu
# of beliefs vs.
time (5 years)
mean avg. precision top 1000
reading accuracy vs.
time (5 years)
12
Function Approximation and Decision tree learning
Function approximation
Problem Setting:
•SetofpossibleinstancesX
•Unknowntargetfunctionf:XàY
•SetoffunctionhypothesesH={h|h:XàY}
superscript: ith training example •Trainingexamples{
Output:
•Hypothesish∈Hthatbestapproximatestargetfunctionf
Input:
13
Simple Training Data Set
Day Outlook Temperature Humidity Wind PlayTennis?
A Decision tree for
f:
Each internal node: test one discrete-valued attribute Xi Each branch from a node: selects one value for Xi Each leaf node: predict Y (or P(Y|X ∈ leaf))
14
Decision Tree Learning
Problem Setting:
•SetofpossibleinstancesX
–each instance x in X is a feature vector
–e.g.,
–Y=1 if we play tennis on this day, else 0 •SetoffunctionhypothesesH={h|h:XàY}
–each hypothesis h is a decision tree –trees sorts x to leaf, which assigns y
Decision Tree Learning
Problem Setting:
•SetofpossibleinstancesX
–each instance x in X is a feature vector x = < x1, x2 … xn>
•Unknowntargetfunctionf:XàY –Y is discrete-valued
•SetoffunctionhypothesesH={h|h:XàY} –each hypothesis h is a decision tree
Input:
•Trainingexamples{
•Hypothesish∈Hthatbestapproximatestargetfunctionf
15
Decision Trees
Suppose X =
where Xi are boolean-valued variables
HowwouldyourepresentY=X2 X5 ? Y=X2 ∨X5
How would you represent X2 X5 ∨ X3X4(¬X1)
16
node = Root
[ID3, C4.5, Quinlan]
Sample Entropy
17
Entropy
Entropy H(X) of a random variable X
H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code)
Why? Information theory:
•Most efficient possible code assigns -log2 P(X=i) bits
to encode the message X=i
•So,expectednumberofbitstocodeonerandomXis:
# of possible values for X
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mutual information (aka Information Gain) of X and Y :
18
Information Gain is the mutual information between input attribute A and target variable Y
Information Gain is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable A
Simple Training Data Set
Day Outlook Temperature Humidity Wind PlayTennis?
19
20
Final Decision Tree for
f:
Each internal node: test one discrete-valued attribute Xi Each branch from a node: selects one value for Xi Each leaf node: predict Y
Which Tree Should We Output?
•ID3 performs heuristic
search through space of decision trees
•It stops at smallest acceptable tree. Why?
Occam’s razor: prefer the simplest hypothesis that fits the data
21
Why Prefer Short Hypotheses? (Occam’s Razor) Arguments in favor:
Arguments opposed:
Why Prefer Short Hypotheses? (Occam’s Razor)
Argument in favor:
•Fewershorthypothesesthanlongones
à a short hypothesis that fits the data is less likely to be a statistical coincidence
à highly probable that a sufficiently complex hypothesis will fit the data
Argument opposed:
•Alsofewerhypotheseswithprimenumberofnodes
and attributes beginning with “Z”
•What’ssospecialabout“short”hypotheses?
22
Overfitting
Consider a hypothesis h and its •Error rate over training data: •True error rate over all data:
We say h overfits the training data if Amount of overfitting =
23
24
Split data into training and validation set
Create tree that classifies training set correctly
25
26
You should know:
•Wellposedfunctionapproximationproblems: –Instance space, X
–Sample of labeled training data {
–Hypothesis space, H = { f: XàY }
•Learningisasearch/optimizationproblemoverH –Various objective functions
•minimize training error (0-1 loss)
•among hypotheses that minimize training error, select smallest (?)
•Decisiontreelearning
–Greedy top-down learning of decision trees (ID3, C4.5, …) –Overfitting and tree/rule post-pruning
–Extensions…
Questions to think about (1)
•ID3 and C4.5 are heuristic algorithms that search through the space of decision trees. Why not just do an exhaustive search?
27
Questions to think about (2)
•Consider target function f:
Questions to think about (3)
•Why use Information Gain to select attributes in decision trees? What other criteria seem reasonable, and what are the tradeoffs in making this choice?
28
Questions to think about (4)
•What is the relationship between learning decision trees, and learning IF-THEN rules
29
Reviews
There are no reviews yet.