5/5 - (1 vote)

The objective of this assignment is for you to experiment with supervised learning using random forests. We will focus on the classification task. The following explains what you need to implement.

You need to first implement a decision tree. Your implementation will include the induction part (for building the tree from the training data) and the inference part (for classifying new samples). We will use only datasets with real-valued attributes. Use CART as the base tree, which is a binary tree. No pruning is required.

When attempting to split a node, you need to select an attribute as well as a threshold. The process:

To select a threshold for a given attribute:
- Let the values of the given attribute be v₁, v₂, , v_n, where n is the number of samples associated with the node.
- For threshold selection, the easiest way is to have the values sorted such that v₁ < v₂ < < v_n. Now you only need to test the following threshold values (v₁+v₂)/2, (v₂+v₃)/2, , (v_n_-1+v_n)/2.
- For each possible threshold, divide the n samples into two groups according to the threshold.
- For each group, compute its Ginis impurity (here the summation is over the output classes):

G1p_k² ^k

For each threshold, compute the total remaining impurity as n_AG_A+n_BG_B. Here A and B indicate the two group of samples. Select the one threshold with the lowest total impurity. This is the threshold for that attribute.

Repeat the above for all the attributes (only those considered when splitting the current node; see attribute bagging), and select the one with the lowest total remaining impurity for splitting the node.

The next task is to build a forest. For this, you need to implement tree bagging (bagging of samples) and attribute bagging.

You need to do some experiments with your trees/forest. Three datasets are provided (as text files) for you to test with. In addition, you can check out the UCI Machine Learning Repository for other datasets. (There are numerous datasets there.)

You need to divide a dataset into a training subset (for tree induction) and a validation subset (for evaluation). Be sure to do the division randomly. You can also try cross-validation, but this is not required. Report the correct classification rates for both the training and validation subsets.

Go to the UCI Machine Learning Repository to get datasets for your experiments. To start, choose from the following datasets that have been used a lot in the literature: Iris, Breast Cancer, Glass, Ionosphere, Optical Recognition of Handwritten Digits, and Wine. You are not required to use all of them. You can go through the list of hundreds of datasets to find others to play with. In addition, I will also provide two two-attribute two-class datasets, ellipse.txt and cross.txt. These two are useful for visualizing the decision boundaries.

Regarding the experiments, below are a few things you can try. You are not required to try them all.

Relative sizes of the training and validation subsets.
Number of trees in the forest.
Parameters used during tree induction, such as how many attributes to consider at each node splitting.
Methods that limit a trees size. Examples include the minimum number of samples per node, or an upper bound on the trees depth.
Comparisons of out-of-bag errors and validation-set errors.
Extremely random forest: At each node splitting, just randomly select an attribute.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] AI Homework2 random forests

Reviews

Whatsapp Us

[Solved] AI Homework2 random forests

Reviews

Related products

[Solved] AI Pacman Challenge

[Solved] Artificial Intelligence Part 2

[Solved] AI Lab3-Uninformed Search Strategies

[Solved] AI ASSIGNMENT 3-Cut, List Processing-II, Set

[Solved] AI ASSIGNMENT 7-Production Systems in Prolog

[Solved] AI ASSIGNMENT 6- Trees in Prolog