- [15 points] Consider a set of n i.d. samples (one-dimensional training patterns), D = {x1, x2, xn}, that are drawn from the following distribution (Rayleigh distribution):
p(x|)= 2 xe x2, x 0,> 0.
- Derive the maximum likelihood estimate (MLE) of , i.e., bmle.
- Consider a set of 1000 training patterns that can be accessed here. Plot the normalized histogram of the training patterns. In the same graph, plot the distribution, p(x), after estimating bmle from these training patterns.
- Using the same training patterns, determine the MLE estimates for the mean and variance of a Gaussian distribution. In this case, you can use the MLE formulae directly. Plot the resulting Gaussian distribution on the same graph as above.
- Comment on which of the two distributions better fit the training data.
- [10 points] Let x have a uniform density
1/, 0 x
p(x)U(0,)=
0, otherwise.
- Suppose that n samples D = {x1, xn} are drawn independently according to p(x|). Show that the MLE for is max[D], i.e., the value of the maximum element in D.
- Suppose that n = 5 points are drawn from the distribution and the maximum value of which happens to be 0.6. Plot the likelihood p(D|) in the range 0 Explain in words why you do not need to know the values of the other 4 points.
- [15 points] Let x =(x1, xd)t be a d-dimensional binary (0 or 1) vector with a multivariate Bernoulli distribution
d
Y xi 1xi , P(x)= i (1 i)
i=1
where = (1,d)t is an unknown parameter vector, i being the probability that xi = 1. Let D = {x1, xn} be a set of n i.i.d. training samples. Show that the maximum likelihood estimate for is
n
1 X
b = n xk.
k=1
(Hint: Consider deriving the MLE for a specific component, i, of vector .)
- [30 points] Consider a two-category (1 and 2) classification problem with equal priors. Each feature is a two-dimensional vector x = (x1, x2)t. The true class-conditional densities are:
p(x|1) N(1 =[0,0]t,1 = I), p(x|2) N(2 =[5,5]t,2 = I).
Generate n=50 bivariate random training samples from each of the two densities.
- Write a program to find the values for the maximum likelihood estimates of 1, 2, 1, and 2 using these training samples (see page 89, use equations (18) and (19)).
- Compute the Bayes decision boundary using the estimated parameters and plot it along with the training samples. What is the empirical error rate on the training samples?
- Compute the Bayes decision boundary using the true parameters and plot it on the same graph. What is the empirical error rate on the training samples?
- Repeat (a) (c) after generating n=500 and n=50,000 random training samples from each of the two densities. How do the estimated parameters and the empirical error rate change in (a) and (b) when the number of representative training samples increases?
- [20 points] The iris (flower) dataset consists of 150 4-dimensional patterns (i.e., feature vectors) belonging to three classes (setosa=1, versicolor=2, and virginica=3). There are 50 patterns per class. The 4 features correspond to sepal length in cm (x1), sepal width in cm (x2), petal length in cm (x3), and petal width in cm (x4). Note that the class labels are indicated at the end of every pattern.
Assume that each class can be modeled by a multivariate Gaussian density, i.e., p(x|i) N(i, i), i = 1,2,3. Write a program to design a Bayes classifier and test it by following the steps below:
- Train the classifier: Using the first 25 patterns of each class (training data), compute i and i, i = 1,2,3. Report these values.
- Design the Bayes classifier: Assuming that the three classes are equally probable and a 0-1 loss function, write a program that inputs a 4-dimensional pattern x and assigns it to one of the three classes based on the maximum posterior rule, i.e., assign x to j if,
j = arg max {P(i|x)}.
i=1,2,3
- Test the classifier: Classify the remaining 25 patterns of each class (test data) using the Bayes classifier constructed above and report the confusion matrix for this three-class problem. What is the empirical error rate on the test set?
- [20 points] The IMOX dataset consists of 192 8-dimensional patterns pertaining to four classes (digital characters I, M, O and X). There are 48 patterns per class. The 8 features correspond to the distance of a character to the (a) upper left boundary, (b) lower right boundary, (c) upper right boundary, (d) lower left boundary, (e) middle left boundary, (f) middle right boundary, (g) middle upper boundary, and (h) middle lower boundary. Note that the class labels (1, 2, 3 or 4) are indicated at the end of every pattern.
- Write a program to project these 8-dimensional points onto a two dimensional plane using PCA (the top 2 eigenvectors). Report the two projection vectors estimated by the technique. Plot the entire dataset in two dimensions using these projection vectors. Use different markers to distinguish the patterns belonging to different classes.
- Write a program to project these 8-dimensional points onto a two dimensional plane using MDA (the top 2 eigenvectors). Report the two projection vectors estimated by the technique. Plot the entire dataset in two dimensions using these projection vectors. Use different markers to distinguish the patterns belonging to different classes.
- Discuss the differences between the PCA and MDA projection vectors.
- [20 points] Assume that the features in the 4-class 8-dimensional IMOX dataset described above are statistically independent. Further, assume that each feature for each of the four classes is normally distributed, i.e., p | , where i = 18 and j = 14.
- Report the MLE estimates of the mean and variance of each feature for each class, i.e., compute ij and 2ij, for i = 18 and j = 14.
- Assuming a 0-1 loss function and equal priors (and statistically independent features having a Gaussian form), design a Bayesian classifier that inputs an 8-dimensional pattern and assigns it to one of the four classes.
- Train this classifier using the first 24 patterns of each class (so, a total of 96 training patterns). Report the confusion matrix and the empirical error rate of this classifier on the remaining 24 patterns of each class (so, a total of 96 test patterns).
- [20 points] Consider a dataset in which every pattern is represented by a set of 15 features. The goal is to identify a subset of 5 features or less that gives the best performance on this dataset. How many feature subsets would each of the following feature selection algorithms consider before identifying a solution (i.e., the number of times the criterion function, J(.), will be invoked)?
- SFS;
- Plus-l-take-away-r with (l, r)=(5,3);
- SBS;
- Exhaustive Search
Reviews
There are no reviews yet.