Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
January 12, 2015
Today:
What is machine learning? Decisiontreelearning
Courselogistics
Readings:
The Discipline of ML Mitchell,Chapter3
Bishop,Chapter14.4
Machine Learning:
Study of algorithms that
improve their performance P at some task T
with experience E
well-defined learning task:
1
Learning to Predict Emergency C-Sections
[Sims et al., 2000]
9714 patient records, each with 215 features
Learning to classify text documents
spam vs
not spam
2
Learning to detect objects in images
(Prof. H. Schneiderman)
Example training images for each orientation
Learn to classify the word a person is thinking about, based on fMRI brain activity
3
Learning prosthetic control from neural implant
[R. Kass L. Castellanos A. Schwartz]
Machine Learning Practice
Mining Databases
Text analysis
Speech Recognition
Control learning
Object recognition
Support Vector Machines Bayesian networks
Hidden Markov models
Deep neural networks
Reinforcement learning .
4
Machine Learning Theory
Other theories for
Reinforcement skill learning
Semi-supervised learning Active student querying
PAC Learning Theory (supervised concept learning)
# examples (m) representational
error rate ()
complexity (H)
failure probability ()
also relating:
# of mistakes during learning
learners query strategy convergence rate
asymptotic performance bias, variance
Machine Learning in Computer Science
Machine learning already the preferred approach to Speech recognition, Natural language processing
Computer vision
Medical outcomes analysis
Robot control
This ML niche is growing (why?)
ML apps.
All software apps.
5
Machine Learning in Computer Science
Machine learning already the preferred approach to Speech recognition, Natural language processing
Computer vision
Medical outcomes analysis
Robot control
ML apps.
All software apps.
This ML niche is growing
Improved machine learning algorithms
Increased volume of online data
Increased demand for self-customizing software
Toms prediction: ML will be fastest-growing part of CS this century
Economics and Organizational Behavior
Evolution
Computer science
Machine learning
Statistics
Animal learning (Cognitive science, Psychology, Neuroscience)
Adaptive Control Theory
6
What Youll Learn in This Course
The primary Machine Learning algorithms
Logistic regression, Bayesian methods, HMMs, SVMs, reinforcement learning, decision tree learning, boosting, unsupervised clustering,
How to use them on real data text, image, structured data
your own project
Underlying statistical and computational theory
Enough to read and understand ML research papers
Course logistics
7
Machine Learning 10-601
website: www.cs.cmu.edu/~ninamf/courses/601sp15
Faculty
MariaBalcan TomMitchell
TAs
TravisDick
KirstenEarly
AhmedHefny
MicolMarchetti-Bowick WillieNeiswanger
AbuSaparov
Course assistant
SharonCavlovich
See webpage for
Officehours
Syllabusdetails
Recitationsessions Gradingpolicy
Honestypolicy
Latehomeworkpolicy Piazzapointers
Highlights of Course Logistics
On the wait list?
Hang in there for first few weeks Homework 1
Available now, due friday
Grading:
30% homeworks (~5-6)
20% course project
25% first midterm (March 2)
25% final midterm (April 29)
Academic integrity:
CheatingFail class, be expelled from CMU
Late homework:
full credit when due
half credit next 48 hrs
zero credit after that
well delete your lowest HW score
must turn in at least n-1 of the n homeworks, even if late
Being present at exams:
You must be there plan now.
Two in-class exams, no other final
8
Maria-Florina Balcan: Nina
FoundationsforModernMachineLearning
E.g., interactive, distributed, life-long learning
Theoretical Computer Science, especially connections between learning theory & other fields
Approx. Algorithms
Control Theory
Game Theory
Machine Learning Theory
Mechanism Design
Discrete Optimization
Matroid Theory
Travis Dick
Whencanwelearnmanyconcepts from mostly unlabeled data by exploiting relationships between between concepts.
Currently:Geometricrelationships
9
Kirstin Early
Analyzingandpredicting energy consumption
Reducecosts/usageandhelp people make informed decisions
Predicting energy costs
from features of home and occupant behavior
Energy disaggregation:
decomposing total electric signal into individual appliances
Ahmed Hefny
Howcanwelearntotrackandpredictthestateofa dynamical system only from noisy observations ?
Can we exploit supervised learning methods to devise a flexible, local minima-free approach ?
observations (oscillating pendulum) Extracted 2D state trajectory
10
Micol Marchetti-Bowick
How can we use machine learning for biological and medical research?
Usinggenotypedatatobuildpersonalized models that can predict clinical outcomes
Integratingdatafrommultiplesourcesto perform cancer subtype analysis
Structuredsparseregressionmodelsfor genome-wide association studies
sample weight
genetic relatedness
xxxxxxxxx yyyyyyyyy
Gene expression data w/ dendrogram (or have one picture per task)
Willie Neiswanger
Ifwewanttoapplymachinelearning algorithms to BIG datasets
Howcanwedevelopparallel,low-communicationmachine learning algorithms?
Suchasembarrassinglyparallelalgorithms,wheremachines work independently, without communication.
11
Abu Saparov
How can knowledge about the world help computers understand natural language?
What kinds of machine learning tools are needed to understand sentences?
Carolyn ate the cake with a fork. Carolyn ate the cake with vanilla.
person_eats_food
consumer
Carolyn
food
cake
instrument
fork
person_eats_food
consumer
Carolyn
food
cake
topping
vanilla
Tom Mitchell
How can we build never-ending learners? Case study: never-ending language learner
(NELL) runs 247 to learn to read the web
see http://rtw.ml.cmu.edu
# of beliefs vs.
time (5 years)
mean avg. precision top 1000
reading accuracy vs.
time (5 years)
12
Function Approximation and Decision tree learning
Function approximation
Problem Setting:
SetofpossibleinstancesX
Unknowntargetfunctionf:XY
SetoffunctionhypothesesH={h|h:XY}
superscript: ith training example Trainingexamples{
Output:
HypothesishHthatbestapproximatestargetfunctionf
Input:
13
Simple Training Data Set
Day Outlook Temperature Humidity Wind PlayTennis?
A Decision tree for
f:
Each internal node: test one discrete-valued attribute Xi Each branch from a node: selects one value for Xi Each leaf node: predict Y (or P(Y|X leaf))
14
Decision Tree Learning
Problem Setting:
SetofpossibleinstancesX
each instance x in X is a feature vector
e.g.,
Y=1 if we play tennis on this day, else 0 SetoffunctionhypothesesH={h|h:XY}
each hypothesis h is a decision tree trees sorts x to leaf, which assigns y
Decision Tree Learning
Problem Setting:
SetofpossibleinstancesX
each instance x in X is a feature vector x = < x1, x2 … xn>
Unknowntargetfunctionf:XY Y is discrete-valued
SetoffunctionhypothesesH={h|h:XY} each hypothesis h is a decision tree
Input:
Trainingexamples{
HypothesishHthatbestapproximatestargetfunctionf
15
Decision Trees
Suppose X =
where Xi are boolean-valued variables
HowwouldyourepresentY=X2 X5 ? Y=X2 X5
How would you represent X2 X5 X3X4(X1)
16
node = Root
[ID3, C4.5, Quinlan]
Sample Entropy
17
Entropy
Entropy H(X) of a random variable X
H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code)
Why? Information theory:
Most efficient possible code assigns -log2 P(X=i) bits
to encode the message X=i
So,expectednumberofbitstocodeonerandomXis:
# of possible values for X
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mutual information (aka Information Gain) of X and Y :
18
Information Gain is the mutual information between input attribute A and target variable Y
Information Gain is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable A
Simple Training Data Set
Day Outlook Temperature Humidity Wind PlayTennis?
19
20
Final Decision Tree for
f:
Each internal node: test one discrete-valued attribute Xi Each branch from a node: selects one value for Xi Each leaf node: predict Y
Which Tree Should We Output?
ID3 performs heuristic
search through space of decision trees
It stops at smallest acceptable tree. Why?
Occams razor: prefer the simplest hypothesis that fits the data
21
Why Prefer Short Hypotheses? (Occams Razor) Arguments in favor:
Arguments opposed:
Why Prefer Short Hypotheses? (Occams Razor)
Argument in favor:
Fewershorthypothesesthanlongones
a short hypothesis that fits the data is less likely to be a statistical coincidence
highly probable that a sufficiently complex hypothesis will fit the data
Argument opposed:
Alsofewerhypotheseswithprimenumberofnodes
and attributes beginning with Z
Whatssospecialaboutshorthypotheses?
22
Overfitting
Consider a hypothesis h and its Error rate over training data: True error rate over all data:
We say h overfits the training data if Amount of overfitting =
23
24
Split data into training and validation set
Create tree that classifies training set correctly
25
26
You should know:
Wellposedfunctionapproximationproblems: Instance space, X
Sample of labeled training data {
Hypothesis space, H = { f: XY }
Learningisasearch/optimizationproblemoverH Various objective functions
minimize training error (0-1 loss)
among hypotheses that minimize training error, select smallest (?)
Decisiontreelearning
Greedy top-down learning of decision trees (ID3, C4.5, ) Overfitting and tree/rule post-pruning
Extensions
Questions to think about (1)
ID3 and C4.5 are heuristic algorithms that search through the space of decision trees. Why not just do an exhaustive search?
27
Questions to think about (2)
Consider target function f:
Questions to think about (3)
Why use Information Gain to select attributes in decision trees? What other criteria seem reasonable, and what are the tradeoffs in making this choice?
28
Questions to think about (4)
What is the relationship between learning decision trees, and learning IF-THEN rules
29
Reviews
There are no reviews yet.