Useful formulas
Decision Trees
Entropy of a set S with elements from C classes: (S)=pi log pi. i=1
Gini impurity of a set S with elements from C classes: G(S)=1p2i
Copyright By Assignmentchef assignmentchef
Gini/Information gain: G(S,F)=M(S) |Sf|M(Sf), where M is either the Gini
impurity or the Entropy.
1. What is the entropy of a dataset? How do you compute it?
The entropy is a measure of how much surprise I can expect, if I extracted an element from a set. If the set contains all elements of the same type, my surprise will be minimal. If the set contains an equal number of elements of each type I dont know what to expect, and my surprise will be maximal.
Technically, it is the expected value of the information of a message that can take n values each one with probability pi : =E[I]=pi log2 pi .
2. How does the algorithm ID3 decide what the next feature to split on is?
It computes the information gain of each feature, and splits with respect to the feature with highest information gain.
3. What does ID3 do when there are no more features left to split on?
Assigns to the leaf of the tree the class of the majority of points that have the feature values corresponding to that leaf.
4. What type of data can decision trees classify which MLPs cannot?
Non-metric data, that is, data points whose features are not numbers.
5. What is a random forest? What are the sources of randomness that diversify the trees in the forest?
A random forest is a collection of decision trees created from the same dataset. Randomness is obtained by using only a random fraction of the features, or a random fraction of the data.
fvalues(F) |S|
6. What is the main difference between the CART and the ID3 algorithms?
CART uses the Gini impurity, while ID3 uses information gain.
7. Consider the following dataset, where data have two features, each of which has three values (A, B, or C), and the last element is the class: , , , , , ,
The information gain for a set S and feature F is: G(S,F)=(S) |Sf|(Sf) f values(F) |S|
So, for Feature 1:
G(S,F1)=(S) |Sf|(Sf)=
f {A ,B ,C} |S|
=123 (2 log2(2)1log2(1)) 4 (3 log2(3)1 log2(1))=0.125
10 3 3 3 3 10 4 4 4 4 while, for Feature 2:
G(S,F2)=1 10 6 (4 log2(4)2 log2(2)) 3 (2log2(2)1 log2(1))=0.174 10 10 6 6 6 6 10 3 3 3 3
Since the second feature has a larger information gain, ID3 splits with respect to it. The full tree is as follows:
8. We want to learn a classifier for car diagnosis. The classes are: OK (O); go to a garage (G); severe failure, dont drive (F). The features are: makes a strange noise (N) or not (nN); emits black smoke (S) or not (nS); going straight, the car drifts on a side (D), or doesnt (nD). We ask a mechanic, and build the following (very extensive) dataset:
The mechanic classifies the fault based on three binary features. If a client calls on the phone, which question should the mechanic ask first? We choose the first feature to split on by computing the information gain, as before (note that in this case we have three classes!):
(S)=1 log (1)4 log (4)3 log (3)=1.406 222
8 8 8 noise no noise
G(S ,Noise)=1.406 ( log ( ) log ( )) ( log ( ) log ( ) log ( ))=0.25
2 2 2 2 2 844448444444
garage dont drive (F) OK dont drive garage G(S,Smoke)=1.40638058(45 log2(45)15log2(15))=0.955
G(S,Drifting)=1.40638(13 log2(13)23 log2(23))58(225 log2(25)15 log2(15))=0.11
The most informative feature is Smoke, and therefore the first question the mechanic should ask is: is the car emitting smoke?
The first split is with respect to Smoke, and all cars that smoke must stop, so this is what we know for now:
Now we need to focus of all the points that have nS as a value of the Smoke feature. This
is the set SnS ={
(SnS)=45 log2(45)15 log2(15)=0.722 .
G(SnS ,Noise)=0.722350521=0.322 .
G(SnS ,Drifting)=0.72225035(23 log2(23)13 log2(13))=0.171 .
The most informative of the two features left is Noise, and this is the resulting tree:
9. Same as the last two questions, but with CART.
CART uses the Gini Impurity. The previous calculations maintain the same structure, but we need to substitute entropy with Gini impurity, and see what happens. Starting with question 7:
gini impurity of the whole set 11321431
22 22 G(S,F1)= (144) 210(1(3)(3))10(1(4)(4))=0.083
16422232212 G(S,F2)=0.510010(1(6)(6))10(1(3)(3))=0.1
In this case as well, feature 2 is the better one, since the average impurity after the split is lower. The resulting tree is the same as for ID3.
Lets now look at Question 8:
noise no noise
432 12 4122212 G(S,Noise)=0.594 (1( ) ( ) ) (1( )( )( ))=0.094
garage dont drive (F) OK dont drive garage 354212
G(S,Smoke)=0.594808(1(5)(5))=0.394 312225 2212
G(S,Drifting)=0.5948(1(3)(3))8(12(5)(5))=0.027 again, Smoke is the first feature. Lets see if the second one is also the same
G(SnS ,Noise)=0.32350250.5=0.12 232212
G(SnS ,Drifting)=0.32505(1(3)(3))=0.053
yes. This tree is also the same as the previous one. This happens quite often, entropy and Gini impurity tend to give the same splits.
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.