1 Naive Bayes
The binary Naive Bayes classifier has interesting connections to the logistic regression classifier. You will show that, under certain assumptions, the Naive Bayes likelihood function is identical in form to the likelihood function for logistic regression. You will then derive the MLE parameter estimates under these assumptions.
- Suppose X = {X1,,XD} is a continuous random vector in RD representing the features and Y is a binary random variable with values in {0,1} representing the class labels. Let the following assumptions hold:
- The label variable Y follows a Bernoulli distribution, with parameter = P(Y = 1).
- For each feature Xj, we have P(Xj|Y = yk) follows a Gaussian distribution of the form N(jk,j).
Using the Naive Bayes assumption that states for all j0 6= j, Xj and Xj0 are conditionally independent given Y , compute P(Y = 1|X) and show that it can be written in the following form:
.
Specifically, you need to find the explicit form of w0 and w in terms of , jk, and j, for j = 1,,D and k {0,1}.
- Suppose a training set with N examples (x1,y1),(x2,y2), ,(xN,yN) is given, where xi = (xi1, ,xiD)> is a D-dimensional feature vector, and yi {0,1} is its corresponding label. Using the assumptions in 1.1 (not the result), provide the maximum likelihood estimation for the parameters of the Naive Bayes with Gaussian assumption. In other words, you need to provide the estimates for , jk, and j, for j = 1,,D and k {0,1}.
2 Logistic Regression
Consider a binary logistic regression model, where the training samples are linearly separable.
- Given n training examples (x1,y1),(x2,y2),,(xn,yn) where yi {0,1}, write down the negative log likelihood, L(w), in terms of the sigmoid function, x, and y.
- Is this loss function convex? Provide your reasoning.
- Show that the magnitude of the optimal w can go to infinity when the training samples are linearly separable.
- A convenient way to prevent numerical instability issues is to add a penalty term to the likelihood function as follows:
, (1)
where and > 0. Compute the gradient respect to wi, i.e. .
- Show that the problem in Eq. (1) has a unique solution.
3 Decision Trees
- Suppose you want to grow a decision tree to predict the accident rate based on the following accident data which provides the rate of accidents in 100 observations. Which predictor variable (weather or traffic) will you choose to split in the first step to maximize the information gain?
Weather | Traffic | Accident Rate | Number of observations |
Sunny | Heavy | High | 23 |
Sunny | Light | Low | 5 |
Rainy | Heavy | High | 50 |
Rainy | Light | Low | 22 |
- Suppose in another dataset, two students experiment with decision trees. The first student runs the decision tree learning algorithm on the raw data and obtains a tree T1. The second student, normalizes the data by subtracting the mean and dividing by the variance of the features. Then, he runs the same decision tree algorithm with the same parameters and obtains a tree T2. How algorithm. As discussed in ESL, Section 9.2.3, the most common splitting criteria are the Gini index and Cross-entropy. Both of these can be viewed as convex surrogates for the misclassification error. Prove that, for any discrete probability distribution p with K classes, the value of the Gini index is less than or equal to the corresponding value of the cross-entropy. This implies that the Gini index more closely approximates the misclassification error.
Definitions: For a K-valued discrete random variable with probability mass function pi,i = 1,,K the Gini index is defined as: ) and the cross-entropy is defined as .
4 Comparing Classifiers in MATLAB/Octave
In this problem, you will work with the same dataset as in HW1, and compare the performance of various classification algorithms. Starting with the one-hot-encoded version of the data that you generated in Question 4a in HW1, perform the following steps:
- Fill in the function naive bayes in the naive m file. In particular, implement the Bernoulli Naive Bayes model from scratch (this will first require you to compute the MLE estimates). The inputs of this function are training data and new data (either validation or testing data). The function needs to output the accuracy on both training and new data (either validation or testing). Note that some feature values might exist in the validation/testing data that do not exist in the training data. In that case, please set the probability of that feature value to a small value, for example, 0.1. Note: You should NOT use any related Matlab toolbox functions, e.g., NaiveBayes.fit to implement Naive Bayes.
- Compare the four algorithms (kNN, Naive Bayes, Decision Tree, and Logistic Regression) on the provided dataset. For each algorithm, report accuracies as detailed below, and describe the relative performance of these algorithms in a few sentences.
kNN: Report results from HW1.
Decision Tree: Train decision trees using the function ClassificationTree.fit or fitctree in Matlab. Report the training, validation and test accuracy for different split criterions (Gini index and cross-entropy using the SplitCriterion attribute) and different settings for the minimum size of leaf nodes to 1,2, ,10 (using the MinLeaf attribute). Thus, in total you will report the results for 2 10 = 20 different cases. When training decision trees, turn off pruning using the Prune attribute.
Naive Bayes: Report the training, validation and test accuracy.
Logistic Regression: Train multi-class logistic regression using the function mnrfit in Matlab. Report the training, validation and test accuracy.
Reviews
There are no reviews yet.