In this assignment, you will be required to implement the Decision Tree algorithm from scratchusing both (a) Information Gain and (b) Gini Index, to decide on the splitting attribute.In Part 1, you will implement it on a small toy dataset to gain confidence in your implementation.In Part 2, you will be given a larger real-world dataset. For both the parts, you will be asked tocompare the accuracy of your implemented model with that of scikit learn .Part 1(Description) Consider that you are willing to buy a car and you have collected informationhaving four attributes price, maintenance, capacity and airbag, and are trying to predictwhether a given car is profitable or not. Assume all the four attributes are categorical, withdiscrete values.(Dataset) Download the training and test data here . The sheet labelled training data containsthe data to be trained on. The sheet named as test data contains the data on which you haveto test your model.(Tasks)(A) Train your decision tree classifier on the train-data (where you will use profitable), usingthe impurity measure:a. Information Gainb. Gini IndexTest your model on test-data (where the profitable label is unseen). After prediction, report theindividual accuracies on the test data obtained using (a) and (b). Note that the profitable fieldshould not be used in the classification process.For both cases, write your program such that it prints out the decision tree, in a particular format.For example, assume that your decision tree looks like the following the attribute price is theroot node and maintenance is the 2nd level node and capacity is the third level node (undermaintenance = low). yes and no specifies the final value of profitable. Then the programshould print out the decision tree as follows:price = low| maintenance = low| capacity = 4 : yes| maintenance = high : noWhere subsequent levels are at increasing indentations from the left.(B) Repeat the experiment using the decision tree algorithm implemented in scikit learn , usingboth Information Gain and Gini index. Report the accuracies on test data.(Deliverables) Your report should contain :1. The decision tree in the format provided in (A)2. The value of Information Gain and Gini Index of the root node using :a. Your modelb. scikit learn3. The labels generated on the test data and accuracy on the test data using :a. Your modelb. scikit learnPart 2(Description) In this part, you will implement the decision tree algorithm to learn a classifier thatcan assign a topic (science, sports, atheism etc.) to any news article.(Dataset)Train and test your algorithms with a subset of the 20 newsgroup dataset . More precisely, youwill use the documents on alt.atheism and comp.graphics newsgroup. To simplify yourimplementation, these articles have been pre-processed and converted to the bag of wordsmodel. Each article is converted to a vector of binary values such that each entry indicateswhether the document contains a specific word or not.Download the training set (traindata.txt) and test set (testdata.txt) of articles with their correctnewsgroup label (trainlabel.txt, testlabel.txt) here .Each line of the files traindata.txt and testdata.txt are formatted docId wordId which indicatesthat word wordId is present in document docId . The files trainlabel.txt and testlabel.txt indicatethe category (1=alt.atheism or 2=comp.graphics) for each document (docId = line number). Thefile words.txt indicates which word corresponds to each wordId (denoted by the line number).(Tasks)(A) Implement the decision tree learning algorithm. Here, each decision node corresponds to aword feature, which is selected by maximizing the information gain.Design your algorithm to take as input a maximum depth. If a branch of the decision treereaches this specified depth, it should not be grown further.Experiment with your algorithm by building trees with increasing maximum depth until a full treeis obtained.Report the training and testing accuracy (i.e., percentage of correctly classified articles) of eachtree by producing a graph with two curves one curve for training accuracy and one curve fortesting accuracy as a function of the maximum depth of the tree.Report also the tree that achieved the highest testing accuracy.(B) Use the decision tree algorithm implemented in scikit learn , using Information Gain, toclassify the same data. Report the accuracies on test data.(Deliverables) Your report should contain :1. A graph showing the training and testing accuracy as the maximum depth increases.2. Does overfitting occur? If yes, after what maximum depth does overfitting occur?3. A brief discussion of the word features selected by the decision tree that achieved thehighest testing accuracy. In your opinion, did all the word features selected make sense?Submission Instructions1. Submit separate codes for Part 1 and Part 22. Submit a README file which will contain the instructions on how to execute your codes3. Submit separate report files for Part 1 and Part 2All source codes, result files and the final report must be uploaded via the course Moodle page,as a single compressed file (.tar.gz or .zip) .The compressed file should be named as: {ROLL_NUMBER}_ML_A2.zip or{ROLL_NUMBER}_ML_A2.tar.gz Example: If your roll number is 16CS60R00, then yoursubmission file should be named as 16CS60R00_ML_A2.tar.gz or 16CS60R00_ML_A2.zipYou can use C / C++ / Java / Python for writing the codes; no other programming language isallowed. You cannot use any library/module meant for Machine Learning for implementing yourmodels (except where mentioned). You can use libraries for other purposes, such as generationand formatting of data. Also you should not use any code available on the Web .Submissions found to be plagiarised or having used ML libraries will be awarded zeromarks for all the students concerned .***Note that the evaluators can deduct marks if the deliverables are not found in the way thathas been asked for the assignment.Submission deadline: March 15, 2019, 23:59 IST [hard deadline]For any questions about the assignment, contact the following TAs:1. Abhisek Dash (assignmentad @ gmail . com)2. Paheli Bhattacharya (paheli.cse.iitkgp @ gmail . com)
CS60050
[Solved] Machine Learning (CS60050) Assignment 2 Decision Trees
$25
File Name: Machine_Learning_(CS60050)_Assignment_2_Decision_Trees.zip
File Size: 508.68 KB
Only logged in customers who have purchased this product may leave a review.
Reviews
There are no reviews yet.