Objective
- Data Input
- Data Preprocessing
- Transform data format and shape so your model can process them.
- Shuffle the data.
- Transform label format so you can do the required two tasks described below Data section.
- Model Construction
For all the models, you need to do two tasks described below in the data section.
-
- The data consists of both categorical and numerical features, and you have to treat them differently.
- Three models must be constructed, Decision Tree, Random Forest, and K-Nearest Neighbor.
- For the Decision Tree model, you may use the following ID3 algorithm pseudocode. 10%
- ID3 (Examples, Target_Attribute, Attributes) Create a root node for the tree4. If all examples are positive, Return the single-node tree Root, with label = +.5. If all examples are negative, Return the single-node tree Root, with label = -.6. If the number of predicting attributes is empty, then Return the single node tree Root,7. with label = most common value of the target attribute in the examples.8. Otherwise Begin9. A The Attribute that best classifies examples.10. Decision Tree attribute for Root = A.11. For each possible value, vi, of A,12. Add a new tree branch below Root, corresponding to the test A = vi.13. Let Examples(vi) be the subset of examples that have the value vi for A14. If Examples(vi) is empty15. Then below this new branch add a leaf node with label = most common target value in the examples16. Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes {A})17. End18. Return Root
Note that you could implement any decision tree algorithm, not restrict to ID3. But you need to clarify which algorithm you used.
-
-
- For the Random Forest model, you must construct multiple Decision Tree models from randomly selected data (from the training subset) and perform voting for prediction. 10%
- For the data selection, the following methods are all acceptable. You could choose one to implement.
- Randomly select features
- Randomly select samples
- Both
- The number of trees must be greater than or equal to 3. You need to try at least 3 different numbers of trees and compare the result.
- Understand the difference between K-fold cross-validation and Random Forest. Confuse one with another, and you wont get this part of the score.
- For the data selection, the following methods are all acceptable. You could choose one to implement.
- For the KNN model, you need to modify categorical features to calculate distance. And you may need to normalize every feature to let your KNN model work as expected. 10%
- You need to try at least 3 different K values and compare their results.
- For the Random Forest model, you must construct multiple Decision Tree models from randomly selected data (from the training subset) and perform voting for prediction. 10%
-
- Validation 5%
- Two validation methods need to be implemented.
- Holdout validation with the ratio K-fold cross-validation with
-
- Show the prediction and reasoning of 1-samples in the validation set. 10%
- Random Forest
- Describe the difference between boosting and bagging. 10%
- KNN
- Pick 2 features, draw and describe the KNN decision boundaries. 10%
- You can pick 2 features to re-train the model, or just fix every other feature values.
- Show the prediction and reasoning of 1-samples in the validation set. 10%
- Pick 2 features, draw and describe the KNN decision boundaries. 10%
-
- Finish during class 20%
- Submit your report and source codes to the newE3 system before class ends.
- Finish time will be determined by the submission time.
Data Student Performance Data Set
- Data can be downloaded here:
- Please NOTE that the last column is the label (G3)
- Two datasets provided (Mathematics, Portuguese language) are both acceptable. You could choose one to analyze.
- Followed by this paper, You will have to do 2 classification tasks:
- Binary classification pass if
- school students school (binary: GP Gabriel Pereira or MS Mousinho da Silveira)
- sex students sex (binary: F female or M male)
- age students age (numeric: from 15 to 22)
- address students home address type (binary: U urban or R rural)
- famsize family size (binary: LE3 less or equal to 3 or GT3 greater than 3)
- Pstatus parents cohabitation status (binary: T living together or A apart)
- Medu mothers education (numeric: 0 none, 1 primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education)
- Fedu fathers education (numeric: 0 none, 1 primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education)
- Mjob mothers job (nominal: teacher, health care related, civil services (e.g. administrative or police), at_home or other)
- Fjob fathers job (nominal: teacher, health care related, civil services (e.g. administrative or police), at_home or other)
- reason reason to choose this school (nominal: close to home, school reputation, course preference or other)
- guardian students guardian (nominal: mother, father or other)
- traveltime home to school travel time (numeric: 1 <15 min., 2 15 to 30 min., 3 30 min. to 1 hour, or 4 >1 hour)
- studytime weekly study time (numeric: 1 <2 hours, 2 2 to 5 hours, 3 5 to 10 hours, or 4 >10 hours)
- failures number of past class failures (numeric: n if 1<=n<3, else 4)
- schoolsup extra educational support (binary: yes or no)
- famsup family educational support (binary: yes or no)
- paid extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- activities extra-curricular activities (binary: yes or no)
- nursery attended nursery school (binary: yes or no)
- higher wants to take higher education (binary: yes or no)
- internet Internet access at home (binary: yes or no)
- romantic with a romantic relationship (binary: yes or no)
- famrel quality of family relationships (numeric: from 1 very bad to 5 excellent)
- freetime free time after school (numeric: from 1 very low to 5 very high)
- goout going out with friends (numeric: from 1 very low to 5 very high)
- Dalc workday alcohol consumption (numeric: from 1 very low to 5 very high)
- Walc weekend alcohol consumption (numeric: from 1 very low to 5 very high)
- health current health status (numeric: from 1 very bad to 5 very good)
- absences number of school absences (numeric: from 0 to 93)
- G1 first period grade (numeric: from 0 to 20)
- G2 second period grade (numeric: from 0 to 20)
- G3 final grade (numeric: from 0 to 20, output target)
- Binary classification pass if
- Holdout validation with the ratio K-fold cross-validation with
- Two validation methods need to be implemented.

![[Solved] NCTU-CS Assignment #2 -Decision Tree & Random Forest & KNN](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] NCTU-CS Assignment #1 Nave Bayes](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.