- Please answer the following questions related to Machine Learning concepts:
- [3 points] Explain what is the bias-variance trade-off?
- [3 points] Describe few techniques to reduce bias and variance respectively.
- [4 points] Describe following regression measures:
- RMSE, 2) R2
- [10 points] Explain following concepts related to classification measures:
- confusion matrix,
- precision,
- recall,
- F1 score,
- ROC curve.
- Please answer the following questions related to Concept Learning:
- [10 points] Consider the following training examples (which is similar to EnjoySpt but with slightly different attributes) and the hypothesis space H that we described in lecture 3-2 (i.e., hypothesis is conjunction of attributes).
Temp | Humidity | Water | Sky | EnjoySport | |
1 | Warm | Normal | Warm | Sunny | Yes |
2 | Warm | Normal | Cold | Sunny | Yes |
3 | Cold | Normal | Warm | Rainy | No |
4 | Warm | High | Warm | Sunny | Yes |
Trace the Candidate-Elimination algorithm to show the sequence of S and G boundary sets.
- Programming problem (40 points):
- In this programming problem, you will get familiar with building a decision tree, using cross validation to prune a tree, evaluating the tree performance, and interpreting the result.
Potential packages to use and short tutorials:
(1)http://scikit-learn.org/stable/modules/tree.html
(2)http://chrisstrelioff.ws/sandbox/2015/06/25/decision_trees_in_python_again_cross _validation.html
classification tree
Use the titanic.csv dataset included in the assignment.
Step 1: read in Titanic.csv and observe a few samples, some features are categorical and others are numerical. Take a random 70% samples for training and the rest 30% for test.
Step 2: fit a decision tree model using independent variables pclass + sex + age + sibsp and dependent variable survived. Plot the full tree. Make sure survived is a qualitative variable taking 1 (yes) or 0 (no) in your code. You may see a tree similar to this one:
Step 3: check the performance of the full model: insample and out-ofsample accuracy, defined as:
insample percent survivors correctly predicted (on training set) in-sample percent fatalities correctly predicted (on training set) outofsample percent survivors correctly predicted (on test set) out-ofsample percent fatalities correctly predicted (on test set)
Step 4: use crossvalidation to find the best parameter to prune the tree. You should be able to plot a graph with the tree size as the x-axis and number of misclassification as the Y-axis. Find the minimum number of misclassification and choose the corresponding tree size to prune the tree. You may have a plot similar to:
Step 5: prune the tree with the optimal tree size. Plot the pruned tree. You may see a similar tree like this:
Step 6: Report as many details as you can on the final pruned tree.
Required reports on: insample and outofsample accuracy, defined as
insample percent survivors correctly predicted (on training set) insample percent fatalities correctly predicted (on training set) outofsample percent survivors correctly predicted (on test set) outofsample percent fatalities correctly predicted (on test set)
Check whether there is improvement in outofsample for the full tree (bigger model) and the pruned tree (smaller model).
Reviews
There are no reviews yet.