5/5 - (1 vote)

Classification

What is a classification?
Examples:

Analyze data to learn which loan applicants are risky and which are safe
Analyze breast cancer data to predict which of the three specific treatments is better
The classification task is aimed at predicting the class (category or label) of data
It is a prediction problem
Classes are discrete values that are not characterized by an order relationship

Classification: a two-step process
Learning (training) step
Each tuple/sample is assumed to belong to a predefined class, as determined by the
class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or mathematical formulae
Classification step
Estimate accuracy of the model
Theknownlabeloftestsampleiscomparedwiththeclassifiedresultfromthemodel
Accuracyrateisthepercentageoftestsetsamplesthatarecorrectlyclassifiedbythemodel Testsetisindependentoftrainingset(otherwiseoverfitting)
If the accuracy is acceptable, use the model to classify new data

Learning step: example
Training Data
Classification Algorithms
Classifier (Model)
Assistant Prof
IF rank = professor OR years > 6
THEN tenured = yes

Classification step: example
Classifier
Testing Data
Unseen Data
(Jeff, Professor, 4)
Assistant Prof
Associate Prof

Decision tree (1)
Flow-chart like structure
Each internal node denotes a test on an attribute Each branch represents an outcome of the test
Each leaf node represents a label/class
Given a tuple X, its attribute values are tested against the decision tree, visiting a path in the tree
The leaf node reached holds the predicted class for X
Decision trees can easily be converted into classification rules

Decision tree (2)
Why are decision tree classifiers so popular?
Their construction does not require domain knowledge or parameter setting
They can handle multidimensional data
Their representation is intuitive
They usually have good accuracy

Decision tree: example
o3v1e.r.4c0ast student? yes
age income student credit_rating buys_comput
credit rating?
excellent fair

Decision tree induction: algorithm
Principle
Greedy algorithm
Tree is constructed in a top-down recursive divide-and-conquer manner
Algorithm
At start, all the training examples are at the root
Tuple are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
Stopping conditions
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning majority voting is
employed for classifying the leaf
There are no tuples left in a branch

Partitioning scenarios
Discrete values
Continuous values
Discrete values producing a binary tree

age income student credit_rating buys_comput
credit_ra ting
buys_comp uter
credit_rat ing
buys_comp uter
credit_rat ing
buys_comp uter
Decision tree induction: example (1)
o3v1e.r.4c0ast

Decision tree induction: example (2)
o3v1e.r.4c0ast student? yes
age income student credit_rating buys_comput
credit rating?
excellent fair
credit_ra ting
buys_comp uter
buys_comp uter
buys_comp uter
credit_ra ting
buys_comp uter

Decision tree induction: example (3)
o3v1e.r.4c0ast student? yes
age income student credit_rating buys_comput
credit rating?
excellent fair

Attribute Selection Measure
Heuristic for selecting the splitting criterion that best separates a given data partition of class-labeled training tuples into individual classes
Ideally each partition would be pure, that is, all tuples in a given partition belong to the same class
Provides a ranking of attributes
Select the attribute with highest score for the measure Determine a split point or a splitting subset
Split tuples according to the selected criterion
Popular measures: information gain, gain ratio, gini index,

Information gain: entropy
Entropy: measures the uncertainty associated with a random variable For a discrete random variable Y taking m distinct values {y1,,ym}:
H = / 0 1 & l o g ( & )
where pi = P(Y=yi)
Higher entropyahigher uncertainty Lower entropyalower uncertainty

Information gain
Choose the attribute with highest information gain
Minimizes the information needed to classify the tuples in the resulting
partitions
Reflects least randomness or impurity in partitions
Information gain minimizes the expected number of tests for classification

Information gain: Info(D)
Expected information needed to classify a tuple in D is
I n f o 5 = / 0 1 & l o g 2 ( & )
m: number of classes
pi: probability that an arbitrary tuple in D belongs to class Ci, estimated as
|Ci,D|/|D| where Ci,D is the set of tuples in class Ci in D
log2: base 2 is used since information is encoded in bits
Info(D) is the entropy of D
Info(D) measures the average amount of information needed to identify the class label of a tuple in D

Info(D): example
Credit_rating
Buy_computer
middle aged
middle aged
middle aged
middle aged
C1 represents yes C2 represents no

Info(D): example
Credit_rating
Buy_computer
middle aged
middle aged
middle aged
middle aged
C1 represents yes, |C1|=9 C2 represents no, |C2|=5
Info(D)= 7 9:; 7 < 9:; < 18 2 18 18 2 18Information gain: InfoA(D) If A is a discrete attribute with v values {a1,…,av}, D is split into v partitions {D1,…,Dv} Measure the amount of information still needed to obtain an exact classification after attribute A has been used for partitioningI n f o A 5 = D@ 0 1 | ? @ | A B C : 5 @ |?| It measures the expected information required to classify a tuple in D based on the partitioning of A The smaller the expected information still required, the greater the purity of the partitions InfoA(D): exampleCredit_ratingBuy_computermiddle agedmiddle agedmiddle agedmiddle agedConsider attribute age D1 represents youth D2 prepresents middle aged D3 represents senior 3yes 2noInfoage(D)= InfoA(D): exampleCredit_ratingBuy_computermiddle agedmiddle agedmiddle agedmiddle agedConsider attribute age D1 represents youth D2 prepresents middle aged D3 represents senior 3yes 2noInfo (D)= < E9:; EF9:; F + age 18 < 2< < 2<8 89:;8 + 18 8 28< F9:; FE9:; E 18 < 2< < 2<Information gain: Gain(A) Information gain for attribute A is the difference between the original information requirement and the one obtained partitioning accordingGain K =Info 5 InfoA 5 The attribute with highest gain is chosen as splitting attribute at node NGain(A): exampleCredit_ratingBuy_computermiddle agedmiddle agedmiddle agedmiddle agedInfo(D) = 0.940Infoage(D) = 0.694Gain(age) = 0.940 – 0.694 = 0.246Gain(income) = 0.029 Gain(student) = 0.151 Gain(credit_rating) = 0.048Age has the highest gain and is selected Continuous valued attributes If A is a continuous attribute, it is necessary to find the best split point order the values of A in increasing order the midpoint between each pair of adjacent values is considered as a possible split-point (ai + ai+1)/2 is the split point between ai and ai+1 evaluate InfoA 5 for each split point and choose the one with minimum InfoA 5 Once identified the split point D1: set of tuples with A <= split_point D2: set of tuples with A > split_point

Gain ratio
Information gain measure is biased towards attributes with a large number of values
Gain ratio applies a normalization to information gain using a split information value computed as
SplitInfo 5 =D 9:; A @01|?| 2 |?|
GainRatio(A)=Gain(A)/SplitInfoA 5

GainRatio(A): example
Credit_rating
Buy_computer
middle aged
middle aged
middle aged
middle aged
Consider attribute income
D1 represents high (4 tuples)
D2 prepresents medium (6 tuples)
D3 represents low (4 tuples)
SplitInfo (D)=89:;8
6 9:; 6 14 2 14
8 9:; 8 18 2 18
= 1.557 GainRatio(income)= 0.029/1.557 = 0.019

Overfitting and tree pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a fully grown treeget a sequence of progressively pruned trees
Use a set of data different from the training data to decide which is the best pruned tree

Tree pruning: example

Bayesian classification
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities
Foundation: Based on Bayes Theorem.
Performance: A simple Bayesian classifier, naive Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
Incremental: Each training example can incrementally increase/decrease the probability that an hypothesis is correct prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Bayes Theorem
X is a tuple, called evidence
H is an hypothesis, such as the fact that X belongs to class C
Classification problem: determine P(H|X)
P(H|X) is the probability that H holds given evidence of tuple X (posterior
probability)
According to Bayes theorem R S T = U T S U(V) U(W)

Probabilities
P(H) is the prior probability that H holds, independently from X
the probability that a costumer will buy a computer (H), regardless of age, income, or
any other information
P(X) is the probability that tuple X is observed
the probability that a person from our set of costumers is 35 years-old and earns 40k (X)
P(H|X) is the posterior probability of H conditioned X
probability that costumer will buy a computer (H) given that we know the costumers
age, which is 35, and income, which is 40k (X)
P(X|H) is the posterior probability of X conditioned H
probability that costumer X, is 35 years-old and earns 40k (X), given that we know that the costumer will buy a computer (H)

Nave Bayesian Classification (1)
D: training set of tuples, each represented by an n-dimensional vector X=(x1,,xn) over attributes A1,,An
C1, , Cm: set of m classes
Given a tuple X, it belongs to the class having the highest posterior
probability conditioned on X:
P(Ci|X) > P(Cj|X) for 1<=j<=m, j<>i
Note that R X- T = U T X- U(Y) and P(X) is constant for all classes,
thus we need to maximize R T X- R(X-)

Nave Bayesian Classification (2)
If class prior probabilities are not known, it is common to assume that P(C1)=P(C2)==P(Cn) and we need to maximize R T X-
For datasets with many attributes, it is expensive to compute R T X- Naive assumption: class conditional independence, hence
R T X- = ]^01 R([|X-) = P x1 Ci P(xn|Ci)

Estimating R([|X-)
R([|X-) is estimated from training tuples
Categorical attributes: number of tuples in class Ci in D having value xk for attribute Ak divided by |Ci,D|
Continuous valued attributes: assuming that Ak has a Gaussian (normal) distribution with mean d and standard deviation e
2s2 g(x,,s)= 1 e-(x-)2
2p s P(X|Ci)=g(xk,Ci ,sCi )

Nave Bayesian Classifier: example (1)
C1: buys_computer = yes C2: buys_computer = no
Data to be classified: X = (age <=30,income = medium, student = yes credit_rating = fair)redit_rbautiyns Nave Bayesian Classifier: example (2)P(Ci): P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) = 5/14= 0.357Compute P(X|Ci) for each classP(age = <=30 | buys_computer = yes) = 2/9 = 0.222P(age = <= 30 | buys_computer = no) = 3/5 = 0.6P(income = medium | buys_computer = yes) = 4/9 = 0.444 P(income = medium | buys_computer = no) = 2/5 = 0.4 P(student = yes | buys_computer = yes) = 6/9 = 0.667 P(student = yes | buys_computer = no) = 1/5 = 0.2 P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667 P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4Compute P(X|Ci)[X = (age<=30 , income=medium, student=yes, credit_rating = fair)]P(X|buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044P(X|buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019Compute P(X|Ci)*P(Ci)P(X|buys_computer = yes) * P(buys_computer = yes) = 0.028P(X|buys_computer = no) * P(buys_computer = no) = 0.007aX belongs to class (buys_computer = yes)age income studentcredit_rbautiynsg_comp 0-probability problem Naive Bayesian prediction requires each conditional probability to be non-zero. Otherwise, the predicted probability will be zero Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10) Use Laplacian correction (or Laplacian estimator) Adding 1 to each caseProb(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The corrected prob. estimates are close to their uncorrected counterparts Nave Bayesian Classifier: pros and cons Advantages Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Naive Bayes ClassifierAccuracy and error measures How can we measure the accuracy of a classi CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] Classification

Reviews

Whatsapp Us

[SOLVED] Classification

Reviews

Related products

[SOLVED] COP 3223 Program #2: P2 Lottery

[Solved] Modify your first program to print a table of the words

[Solved] Python program to manage information about baseball players

[Solved] Modularized Body Mass Index (BMI) Program in Python

[SOLVED] ITEC136 Python Program

[Solved] Car.py