, , ,

[SOLVED] Cse/isye 6740 homeworks 1 to 4 solution

$25

File Name: Cse_isye_6740_homeworks_1_to_4_solution.zip
File Size: 367.38 KB

5/5 - (1 vote)

1 Probability [15 pts]
(a) Stores A, B, and C have 50, 75, and 100 employees and, respectively, 50, 60, and 70 percent
of these are women. Resignations are equally likely among all employees, regardless of stores
and sex. Suppose an employee resigned, and this was a woman. What is the probability that
she has worked in store C? [5 pts]
(b) A laboratory blood test is 95 percent effective in detecting a certain disease when it is, in
fact, present. The test also yields a false positive result for 1 percent of the healthy persons
tested. That is, if a healthy person is tested then with probability 0.01 the test result will
imply he has the disease. If 0.5 percent of the population actually has the disease, what is the
probability a person has the disease given that his test result is positive? [5 pts]
[c-d] On the morning of September 31, 1982, the won-lost records of the three leading baseball teams in the
western division of the National League of the United States were as follows:
Team Won Lost
Atlanta Braves 87 72
San Francisco Giants 86 73
Los Angeles Dodgers 86 73
Each team had 3 games remaining to be played. All 3 of the Giants games were with the Dodgers, and
the 3 remaining games of the Braves were against the San Diego Padres. Suppose that the outcomes of all
remaining games are independent and each game is equally likely to be won by either participant. If two
teams tie for first place, they have a playoff game, which each team has an equal chance of winning.
(c) What is the probability that Atlanta Braves wins the division? [2 pts]
(d) What is the probability to have an additional playoff game? [3 pts]
1Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006, Springer.
1
2 Maximum Likelihood [15 pts]
Suppose we have n i.i.d (independent and identically distributed) data samples from the following probability
distribution. This problem asks you to build a log-likelihood function, and find the maximum likelihood
estimator of the parameter(s).
(a) Poisson distribution [5 pts]
The Poisson distribution is defined as
P(xi = k) = λ
k
e
−λ
k!
(k = 0, 1, 2, …).
What is the maximum likelihood estimator of λ?
(b) Multinomial distribution [5 pts]
The probability density function of Multinomial distribution is given by
f(x1, x2, . . . , xk; n, θ1, θ2, . . . , θk) = n!
x1!x2! · · · xk!
Y
k
j=1
θ
xj
j
,
where Pk
j=1 θj = 1,
Pk
j=1 xj = n. What is the maximum likelihood estimator of θj , j = 1, . . . k?
(c) Gaussian normal distribution [5 pts]
Suppose we have n i.i.d (Independent and Identically Distributed) data samples from a univariate Gaussian
normal distribution N (µ, σ2
), which is given by
N (x; µ, σ2
) = 1
σ


exp 

(x − µ)
2

2

.
What is the maximum likelihood estimator of µ and σ
2
?
3 Principal Component Analysis [20 pts]
In class, we learned that Principal Component Analysis (PCA) preserves variance as much as possible. We
are going to explore another way of deriving it: minimizing reconstruction error.
Consider data points xn(n = 1, …, N) in D-dimensional space. We are going to represent them in
{u1, …, uD} orthonormal basis. That is,
x
n =
X
D
i=1
α
n
i ui =
X
D
i=1
(xnT
ui)ui
.
Here, α
n
i
is the length when xn is projected onto ui
.
Suppose we want to reduce the dimension from D to M < D. Then the data point xn is approximated
by
˜x
n =
X
M
i=1
z
n
i ui +
X
D
i=M+1
biui
.
2
In this representation, the first M directions of ui are allowed to have different coefficient z
n
i
for each data
point, while the rest has a constant coefficient bi
. As long as it is the same value for all data points, it does
not need to be 0.
Our goal is setting ui
, z
n
i
, and bi for n = 1, …, N and i = 1, …, D so as to minimize reconstruction error.
That is, we want to minimize the difference between xn and ˜xn
over {ui
, zn
i
, bi}:
J =
1
N
X
N
n=1
kx
n − ˜x
n
k
2
.
(a) What is the assignment of z
n
j
for j = 1, …, M minimizing J? [5 pts]
(b) What is the assignment of bj for j = M + 1, …, D minimizing J? [5 pts]
(c) Express optimal x˜
n
and xn − x˜
n using your answer for (a) and (b). [2 pts]
(d) What should be the ui for i = 1, …, D to minimize J? [8 pts]
Hint: Use S =
1
N
PN
n=1(xn − ¯x)(xn − ¯x)T
for sample covariance matrix.
4 Clustering [20 pts]
[a-b] Given N data points xn(n = 1, . . . , N), K-means clustering algorithm groups them into K clusters by
minimizing the distortion function over {r
nk, µk}
J =
X
N
n=1
X
K
k=1
r
nkkx
n − µ
k
k
2
,
where r
nk = 1 if xn belongs to the k-th cluster and r
nk = 0 otherwise.
(a) Prove that using the squared Euclidean distance kx
n − µ
kk
2 as the dissimilarity function
and minimizing the distortion function, we will have
µ
k =
P
n
r
nk
P
xn
n
r
nk .
That is, µ
k
is the center of k-th cluster. [5 pts]
(b) Prove that K-means algorithm converges to a local optimum in finite steps. [5 pts]
[c-d] In class, we discussed bottom-up hierarchical clustering. For each iteration, we need to find two clusters
{x1, x2, . . . , xm} and {y1
, y2
, . . . , yp} with the minimum distance to merge. Some of the most commonly used
distance metrics between two clusters are:
• Single linkage: the minimum distance between any pairs of points from the two clusters, i.e.
min
i=1,…,m
j=1,…,p
kxi − yjk
• Complete linkage: the maximum distance between any parts of points from the two clusters, i.e.
max
i=1,…,m
j=1,…,p
kxi − yjk
3
• Average linkage: the average distance between all pair of points from the two clusters, i.e.
1
mp
Xm
i=1
Xp
j=1
kxi − yjk
(c) When we use the bottom up hierarchical clustering to realize the partition of data, which
of the three cluster distance metrics described above would most likely result in clusters most
similar to those given by K-means? (Suppose K is a power of 2 in this case). [5 pts]
(d) For the following data (two moons), which of these three distance metrics (if any) would
successfully separate the two moons? [5 pts]
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
5 Programming: Image compression [30 pts]
In this programming assignment, you are going to apply clustering algorithms for image compression. Before
starting this assignment, we strongly recommend reading PRML Section 9.1.1, page 428 – 430.
To ease your implementation, we provide a skeleton code containing image processing part. homework1.m
is designed to read an RGB bitmap image file, then cluster pixels with the given number of clusters K. It
shows converted image only using K colors, each of them with the representative color of centroid. To see
what it looks like, you are encouraged to run homework1(‘beach.bmp’, 3) or homework1(‘football.bmp’,
2), for example.
Your task is implementing the clustering parts with two algorithms: K-means and K-medoids. We
learned and demonstrated K-means in class, so you may start from the sample code we distributed.
The file you need to edit is mykmeans.m and mykmedoids.m, provided with this homework. In the files,
you can see it calls Matlab function kmeans initially. Comment this line out, and implement your own in
the files. You would expect to see similar result with your implementation of K-means, instead of kmeans
function in Matlab.
4
K-medoids
In class, we learned that the basic K-means works in Euclidean space for computing distance between data
points as well as for updating centroids by arithmetic mean. Sometimes, however, the dataset may work
better with other distance measures. It is sometimes even impossible to compute arithmetic mean if a feature
is categorical, e.g, gender or nationality of a person. With K-medoids, you choose a representative data point
for each cluster instead of computing their average.
Given N data points xn(n = 1, …, N), K-medoids clustering algorithm groups them into K clusters
by minimizing the distortion function J =
PN
n=1
PK
k=1 r
nkD(xn, µk
), where D(x, y) is a distance measure
between two vectors x and y in same size (in case of K-means, D(x, y) = kx − yk
2
), µ
k
is the center of k-th
cluster; and r
nk = 1 if xn belongs to the k-th cluster and r
nk = 0 otherwise. In this exercise, we will use the
following iterative procedure:
• Initialize the cluster center µ
k
, k = 1, …, K.
• Iterate until convergence:
– Update the cluster assignments for every data point xn: r
nk = 1 if k = arg minj D(xn, µj
), and
r
nk = 0 otherwise.
– Update the center for each cluster k: choosing another representative if necessary.
There can be many options to implement the procedure; for example, you can try many distance measures
in addition to Euclidean distance, and also you can be creative for deciding a better representative of each
cluster. We will not restrict these choices in this assignment. You are encouraged to try many distance
measures as well as way of choosing representatives.
Formatting instruction
Both mykmeans.m and mykmedoids.m take input and output format as follows. You should not alter this
definition, otherwise your submission will print an error, which leads to zero credit.
Input
• pixels: the input image representation. Each row contains one data point (pixel). For image dataset, it
contains 3 columns, each column corresponding to Red, Green, and Blue component. Each component
has an integer value between 0 and 255.
• K: the number of desired clusters. Too high value of K may result in empty cluster error. Then, you
need to reduce it.
Output
• class: cluster assignment of each data point in pixels. The assignment should be 1, 2, 3, etc. For
K = 5, for example, each cell of class should be either 1, 2, 3, 4, or 5. The output should be a column
vector with size(pixels, 1) elements.
• centroid: location of K centroids (or representatives) in your result. With images, each centroid
corresponds to the representative color of each cluster. The output should be a matrix with K rows
and 3 columns. The range of values should be [0, 255], possibly floating point numbers.
Hand-in
Both of your code and report will be evaluated. Upload mykmeans.m and mykmedoids.m files with your
implementation. In your report, answer to the following questions:
5
1. Within the K-medoids framework, you have several choices for detailed implementation. Explain how
you designed and implemented details of your K-medoids algorithm, including (but not limited to)
how you chose representatives of each cluster, what distance measures you tried and chose one, or when
you stopped iteration.
2. Attach a picture of your own. We recommend size of 320 × 240 or smaller.
3. Run your K-medoids implementation with the picture you chose above, with several different K. (e.g,
small values like 2 or 3, large values like 16 or 32) What did you observe with different K? How long
does it take to converge for each K?
4. Run your K-medoids implementation with different initial centroids/representatives. Does it affect
final result? Do you see same or different result for each trial with different initial assignments? (We
usually randomize initial location of centroids in general. To answer this question, an intentional poor
assignment may be useful.)
5. Repeat question 2 and 3 with K-means. Do you see significant difference between K-medoids and
K-means, in terms of output quality, robustness, or running time?
Note
• You may see some error message about empty clusters even with Matlab implementation, when you
use too large K. Your implementation should treat this exception as well. That is, do not terminate
even if you have an empty cluster, but use smaller number of clusters in that case.
• We will grade using test pictures which are not provided. We recommend you to test your code with
several different pictures so that you can detect some problems that might happen occasionally.
• If we detect copy from any other student’s code or from the web, you will not be eligible for any credit
for the entire homework, not just for the programming part. Also, directly calling Matlab function
kmeans or other clustering functions is not allowed.

1 EM for Mixture of Gaussians
Mixture of K Gaussians is represented as
p(x) = X
K
k=1
πkN (x|µk, Σk), (1)
where πk represents the probability that a data point belongs to the kth component. As it is probability, it
satisfies 0 ≤ πk ≤ 1 and P
k
πk = 1. In this problem, we are going to represent this in a slightly different
manner with explicit latent variables. Specifically, we introduce 1-of-K coding representation for latent
1https://matlab.mathworks.com/
2You can also use the environment for developing by registering online with your university email.
3Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006, Springer.
1
variables z
(k) ∈ R
K for k = 1, …, K. Each z
(k)
is a binary vector of size K, with 1 only in kth element and
0 in all others. That is,
z
(1) = [1; 0; …; 0]
z
(2) = [0; 1; …; 0]
.
.
.
z
(K) = [0; 0; …; 1].
For example, if the second component generated data point x
n, its latent variable z
n is given by [0; 1; …; 0] =
z
(2). With this representation, we can express p(z) as
p(z) = Y
K
k=1
πk
zk
,
where zk indicates kth element of vector z. Also, p(x|z) can be represented similarly as
p(x|z) = Y
K
k=1
N (x|µk, Σk)
zk
.
By the sum rule of probability, (1) can be represented by
p(x) = X
z∈Z
p(z)p(x|z). (2)
where Z = {z
(1), z(2), …, z(K)}.
(a) Show that (2) is equivalent to (1). [5 pts]
(b) In reality, we do not know which component each data point is from. Thus, we estimate the
responsibility (expectation of z
n
k
) in the E-step of EM. Since z
n
k
is either 1 or 0, its expectation
is the probability for the point xn to belong to the component zk. In other words, we estimate
p(z
n
k
|xn). Derive the formula for this estimation by using Bayes rule. Note that, in the E-step,
we assume all other parameters, i.e. πk, µk, and Σk, are fixed, and we want to express p(z
n
k
|xn)
as a function of these fixed parameters. [10 pts]
(c) In the M-Step, we re-estimate parameters πk, µk, and Σk by maximizing the log-likelihood.
Given N i.i.d (Independent Identically Distributed) data samples, derive the update formula
for each parameter. Note that in order to obtain an update rule for the M-step, we fix the
responsibilities, i.e. p(z
n
k
|xn), which we have already calculated in the E-step. [15 pts]
Hint: Use Lagrange multiplier for πk to apply constraints on it.
(d) EM and K-Means [10 pts]
K-means can be viewed as a particular limit of EM for Gaussian mixture. Considering a mixture model in
which all components have covariance I, show that in the limit  → 0, maximizing the expected complete
data log-likelihood for this model is equivalent to minimizing objective function in K-means:
J =
X
N
n=1
X
K
k=1
γnkkxn − µkk
2
,
where γnk = 1 if xn belongs to the k-th cluster and γnk = 0 otherwise.
2
2 Density Estimation
Consider a histogram-like density model in which the space x is divided into fixed regions for which density
p(x) takes constant value hi over ith region, and that the volume of region i in denoted as ∆i
. Suppose we
have a set of N observations of x such that ni of these observations fall in regions i.
(a) What is the log-likelihood function? [8 pts]
(b) Derive an expression for the maximum likelihood estimator for hi. [10 pts]
Hint: This is a constrained optimization problem. Remember that p(x) must integrate to unity. Since p(x)
has constant value hi over region i, which has volume ∆i
. The normalization constraint is P
i
hi∆i = 1. Use
Lagrange multiplier by adding λ (
P
i
hi∆i − 1) to your objective function.
(c) Mark T if it is always true, and F otherwise. Briefly explain why. [12 pts]
• Non-parametric density estimation usually does not have parameters.
• The Epanechnikov kernel is the optimal kernel function for all data.
• Histogram is an efficient way to estimate density for high-dimensional data.
• Parametric density estimation assumes the shape of probability density.
3 Information Theory
In the lecture you became familiar with the concept of entropy for one random variable and mutual information. For a pair of discrete random variables X and Y with the joint distribution p(x, y), the joint entropy
H(X, Y ) is defined as
H(X, Y ) = −
X
x∈X
X
y∈Y
p(x, y) log p(x, y) (3)
which can also be expressed as
H(X, Y ) = −E[log p(X, Y )] (4)
Let X and Y take on values x1, x2, …, xr and y1, y2, …, ys respectively. Let Z also be a discrete random
variable and Z = X + Y .
(a) Prove that H(X, Y ) ≤ H(X) + H(Y ) [4 pts]
(b) Show that I(X; Y ) = H(X) + H(Y ) − H(X, Y ). [2 pts]
(c) Under what conditions does H(Z) = H(X) + H(Y ). [4 pts]
4 Programming: Text Clustering
In this problem, we will explore the use of EM algorithm for text clustering. Text clustering is a technique for
unsupervised document organization, information retrieval. We want to find how to group a set of different
text documents based on their topics. First we will analyze a model to represent the data.
3
Bag of Words
The simplest model for text documents is to understand them as a collection of words. To keep the model
simple, we keep the collection unordered, disregarding grammar and word order. What we do is counting
how often each word appears in each document and store the word counts into a matrix, where each row of
the matrix represents one document. Each column of matrix represent a specific word from the document
dictionary. Suppose we represent the set of nd documents using a matrix of word counts like this:
D1:nd =


2 6 … 4
2 4 … 0
.
.
.
.
.
.

 = T
This means that word W1 occurs twice in document D1 . Word Wnw occurs 4 times in document D1 and
not at all in document D2.
Multinomial Distribution
The simplest distribution representing a text document is multinomial distribution(Bishop Chapter 2.2).
The probability of a document Di
is:
p(Di) = Ynw
j=1
µ
Tij
j
Here, µj denotes the probability of a particular word in the text being equal to wj , Tij is the count of the
word in document. So the probability of document D1 would be p(D1) = µ
2
1
· µ
6
2
· … · µ
4
nw
·
Mixture of Multinomial Distributions
In order to do text clustering, we want to use a mixture of multinomial distributions, so that each topic has
a particular multinomial distribution associated with it, and each document is a mixture of different topics.
We define p(c) = πc as the mixture coefficient of a document containing topic c, and each topic is modeled
by a multinomial distribution p(Di
|c) with parameters µjc, then we can write each document as a mixture
over topics as
p(Di) = Xnc
c=1
p(Di
|c)p(c) = Xnc
c=1
πc
Ynw
j=1
µ
Tij
jc
EM for Mixture of Multinomials
In order to cluster a set of documents, we need to fit this mixture model to data. In this problem, the EM
algorithm can be used for fitting mixture models. This will be a simple topic model for documents. Each
topic is a multinomial distribution over words (a mixture component). EM algorithm for such a topic model,
which consists of iterating the following steps:
1. Expectation
Compute the expectation of document Di belonging to cluster c:
γic =
πc
Qnw
j=1 µ
Tij
jc
Pnc
c=1 πc
Qnw
j=1 µ
Tij
jc
4
2. Maximization
Update the mixture parameters, i.e. the probability of a word being Wj in cluster (topic) c, as well as
prior probability of each cluster.
µjc =
Pnd
i=1
P
γicTij
nd
i=1
Pmw
l=1 γicTil
πc =
1
nd
Xnd
i=1
γic
Task [20 pts]
Implement the algorithm and run on the toy dataset data.mat. You can find detailed description about the
data in the homework2.m file. Observe the results and compare them with the provided true clusters each
document belongs to. Report the evaluation (e.g. accuracy) of your implementation.
Hint: We already did the word counting for you, so the data file only contains a count matrix like the
one shown above. For the toy dataset, set the number of clusters nc = 4. You will need to initialize the
parameters. Try several different random initial values for the probability of a word being Wj in topic c,
µjc. Make sure you normalized it. Make sure that you should not use the true cluster information during
your learning phase.
Extra Credit: Realistic Topic Models [20pts]
The above model assumes all the words in a document belongs to some topic at the same time. However,
in real world datasets, it is more likely that some words in the documents belong to one topic while other
words belong to some other topics. For example, in a news report, some words may talk about “Ebola” and
“health”, while others may mention “administration” and “congress”. In order to model this phenomenon,
we should model each word as a mixture of possible topics.
Specifically, consider the log-likelihood of the joint distribution of document and words
L =
X
d∈D
X
w∈W
Tdw log P(d, w), (5)
where Tdw is the counts of word w in the document d. This count matrix is provided as input.
The joint distribution of a specific document and a specific word is modeled as a mixture
P(d, w) = X
z∈Z
P(z)P(w|z)P(d|z), (6)
where P(z) is the mixture proportion, P(w|z) is the distribution over the vocabulary for the z-th topic, and
P(d|z) is the probability of the document for the z-th topic. And these are the parameters for the model.
The E-step calculates the posterior distribution of the latent variable conditioned on all other variables
P(z|d, w) = P(z)P(w|z)P(d|z)
P
z
0 P(z
0)P(w|z
0)P(d|z
0)
. (7)
In the M-step, we maximizes the expected complete log-likelihood with respect to the parameters, and
get the following update rules
P(w|z) =
P
d
TdwP(z|d, w)
P
w0
P
d
Tdw0P(z|d, w0)
(8)
P(d|z) =
P
w
TdwP(z|d, w)
P
d0
P
w
Td0wP(z|d
0
, w)
(9)
P(z) =
P
d
P
w
TdwP(z|d, w)
P
z
0
P
d0
P
w0 Td0w0P(z
0
|d
0
, w0)
. (10)
5
Task
Implement EM for maximum likelihood estimation and cluster the text data provided in the nips.mat file
you downloaded. You can print out the top key words for the topics/clusters by using the show topics.m
utility. It takes two parameters: 1) your learned conditional distribution matrix, i.e., P(w|z) and 2) a cell
array of words that corresponds to the vocabulary. You can find the cell array wl in the nips.mat file. Try
different values of k and see which values produce sensible topics. In assessing your code, we will use another
dataset and observe the produces topics.

1 Linear Regression [30 pts]
In class, we derived a closed form solution (normal equation) for linear regression problem: ˆθ = (XT X)
−1XT Y .
A probabilistic interpretation of linear regression tells us that we are relying on an assumption that each
data point is actually sampled from a linear hyperplane, with some noise. The noise follows a zero-mean
Gaussian distribution with constant variance. Specifically,
Y
i = θ
T Xi + 
i
(1)
where 
i ∼ N (0, σ2
I), θ ∈ R
d
, and {Xi
, Y i} is the i-th data point. In other words, we are assuming that
each every point is independent to each other and that every data point has same variance.
1https://matlab.mathworks.com/
2You can also use the environment for developing by registering online with your university email.
3Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006, Springer.
1
(a) Using the normal equation, and the model (Eqn. 1), derive the expectation E[
ˆθ]. Note
that here X is fixed, and only Y is random, i.e. “fixed design” as in statistics. [6 pts]
(b) Similarly, derive the variance Var[
ˆθ]. [6 pts]
(c) Under the white noise assumption above, someone claims that ˆθ follows Gaussian distribution with mean and variance in (a) and (b), respectively. Do you agree with this claim? Why
or why not? [8 pts]
(d) Weighted linear regression
Suppose we keep the independence assumption but remove the same variance assumption. In other words,
data points would be still sampled independently, but now they may have different variance σi
. Thus, the
covariance matrix of Y would be still diagonal, but with different values:
Σ =





σ
2
1 0 . . . 0
0 σ
2
2
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . σ2
n





. (2)
Derive the estimator ˆθ (similar to the normal equations) for this problem using matrix-vector notations with
Σ. [10 pts]
2 Ridge Regression [15 pts]
For linear regression, it is often assumed that y = θ
>x +  where θ, x ∈ R
m by absorbing the constant
term, and  ∼ N (0, σ2
) is a Gaussian random variable. Given n i.i.d samples (x
1
, y1
), …,(x
n, yn), we define
y = (y
1
, …, yn)
> and X = (x
1
, …, x
n)
>. Thus, we have y ∼ N (Xθ, σ2
I). Show that the ridge regression
estimate is the mean of the posterior distribution under a Gaussian prior θ ∼ N (0, τ 2
I). Find the explicit
relation between the regularization parameter λ in the ridge regression estimate of the parameter θ, and the
variances σ
2
, τ 2
.
3 Bayes Classifier
3.1 Bayes Classifier With General Loss Function
In class, we talked about the popular 0-1 loss function in which L(a, b) = 1 for a 6= b and 0 otherwise,
which means all wrong predictions cause equal loss. Yet, in many other cases including cancer detection,
the asymmetric loss is often preferred (misdiagnosing cancer as no-cancer is much worse). In this problem,
we assume to have such an asymmetric loss function where L(a, a) = L(b, b) = 0 and L(a, b) = p, L(b, a) =
q, p 6= q. Write down the the Bayes classifier f : X → Y for binary classification Y ∈ {−1, +1}. Simplify the
classification rule as much as you can. [20 pts]
3.2 Gaussian Class Conditional distribution
(a) Suppose the class conditional distribution is a Gaussian. Based on the general loss function in problem
3.1, write the Bayes classifier as f(X) = sign(h(X)) and simplify h as much as possible. What is the
geometric shape of the decision boundary? [10 pts]
2
(b) Repeat (a) but assume the two Gaussians have identical covariance matrices. What is the geometric
shape of the decision boundary? [10 pts]
(c) Repeat (a) but assume now that the two Gaussians have covariance matrix which is equal to the identity
matrix. What is the geometric shape of the decision boundary? [10 pts]
4 Logistic Regression
Logistic regression is named after the log-odds of success (the logit of the probability) defined as below:
ln 
P[Y = 1|X = x]
P[Y = 0|X = x]

where
P[Y = 1|X = x] = exp(w0 + w
T x)
1 + exp(w0 + wT x)
(a) Show that log-odds of success is a linear function of X. [6 pts]
(b) Show that the logistic loss L(z) = log (1 + exp(−z)) is a convex function. [9 pts]
5 Programming: Recommendation System [40 pts]
Personalized recommendation systems are used in a wide variety of applications such as electronic commerce,
social networks, web search, and more. Machine learning techniques play a key role to extract individual
preference over items. In this assignment, we explore this popular business application of machine learning,
by implementing a simple matrix-factorization-based recommender using gradient descent.
Suppose you are an employee in Netflix. You are given a set of ratings (from one star to five stars)
from users on many movies they have seen. Using this information, your job is implementing a personalized
rating predictor for a given user on unseen movies. That is, a rating predictor can be seen as a function
f : U × I → R, where U and I are the set of users and items, respectively. Typically the range of this
function is restricted to between 1 and 5 (stars), which is the the allowed range of the input.
Now, let’s think about the data representation. Suppose we have m users and n items, and a rating
given by a user on a movie. We can represent this information as a form of matrix, namely rating matrix M.
Suppose rows of M represent users, while columns do movies. Then, the size of matrix will be m × n. Each
cell of the matrix may contain a rating on a movie by a user. In M15,47, for example, it may contain a rating
on the item 47 by user 15. If he gave 4 stars, M15,47 = 4. However, as it is almost impossible for everyone to
watch large portion of movies in the market, this rating matrix should be very sparse in nature. Typically,
only 1% of the cells in the rating matrix are observed in average. All other 99% are missing values, which
means the corresponding user did not see (or just did not provide the rating for) the corresponding movie.
Our goal with the rating predictor is estimating those missing values, reflecting the user’s preference learned
from available ratings.
Our approach for this problem is matrix factorization. Specifically, we assume that the rating matrix M
is a low-rank matrix. Intuitively, this reflects our assumption that there is only a small number of factors
(e.g, genre, director, main actor/actress, released year, etc.) that determine like or dislike. Let’s define r as
3
the number of factors. Then, we learn a user profile U ∈ R
m×r and an item profile V ∈ R
n×r
. (Recall that
m and n are the number of users and items, respectively.) We want to approximate a rating by an inner
product of two length r vectors, one representing user profile and the other item profile. Mathematically, a
rating by user u on movie i is approximated by
Mu,i ≈
Xr
k=1
Uu,kVi,k. (3)
We want to fit each element of U and V by minimizing squared reconstruction error over all training data
points. That is, the objective function we minimize is given by
E(U, V ) = X
(u,i)∈M
(Mu,i − U
T
u Vi)
2 =
X
(u,i)∈M
(Mu,i −
Xr
k=1
Uu,kVi,k)
2
(4)
where Uu is the uth row of U and Vi
is the ith row of V . We observe that this looks very similar to the
linear regression. Recall that we minimize in linear regression:
E(θ) = Xm
i=1
(Y
i − θ
T x
i
)
2 =
Xm
i=1
(Y
i −
Xr
k=1
θkx
i
k
)
2
(5)
where m is the number of training data points. Let’s compare (4) and (5). Mu,i in (4) corresponds to Y
i
in (5), in that both are the observed labels. U
T
u Vi
in (4) corresponds to θ
T x
i
in (5), in that both are our
estimation with our model. The only difference is that both U and V are the parameters to be learned in
(4), while only θ is learned in (5). This is where we personalize our estimation: with linear regression, we
apply the same θ to any input x
i
, but with matrix factorization, a different profile Uu are applied depending
on who is the user u.
As U and V are interrelated in (4), there is no closed form solution, unlike linear regression case. Thus,
we need to use gradient descent:
Uv,k ← Uv,k − µ
∂E(U, V )
∂Uv,k
, Vj,k ← Vj,k − µ
∂E(U, V )
∂Vj,k
, (6)
where µ is a hyper-parameter deciding the update rate. It would be straightforward to take partial derivatives
of E(U, V ) in (4) with respect to each element Uv,k and Vj,k. Then, we update each element of U and V
using the gradient descent formula in (6).
(a) Derive the update formula in (6) by solving the partial derivatives. [10 pts]
(b) To avoid overfitting, we usually add regularization terms, which penalize for large values
in U and V . Redo part (a) using the regularized objective function below. [5 pts]
E(U, V ) = X
(u,i)∈M
(Mu,i −
Xr
k=1
Uu,kVi,k)
2 + λ
X
u,k
U
2
u,k + λ
X
i,k
V
2
i,k
(λ is a hyper-parameter controlling the degree of penalization.)
(c) Implement myRecommender.m by filling the gradient descent part.
You are given a skeleton code myRecommender.m. Using the training data rateMatrix, you will implement your own recommendation system of rank lowRank. The only file you need to edit and submit is
myRecommender.m. In the gradient descent part, repeat your update formula in (b), observing the average
reconstruction error between your estimation and ground truth in training set. You need to set a stopping
criteria, based on this reconstruction error as well as the maximum number of iterations. You should play
with several different values for µ and λ to make sure that your final prediction is accurate.
Formatting information is here:
4
Input
• rateMatrix: training data set. Each row represents a user, while each column an item. Observed
values are one of {1, 2, 3, 4, 5}, and missing values are 0.
• lowRank: the number of factors (dimension) of your model. With higher values, you would expect
more accurate prediction.
Output
• U: the user profile matrix of dimension user count × low rank.
• V: the item profile matrix of dimension item count × low rank.
Evaluation [15 pts]
Upload your myRecommender.m implementation file. (Do not copy and paste your code in your report. Be
sure to upload your myRecommender.m file.)
To test your code, try to run homework3.m. You may have noticed that the code prints both training
and test error, in RMSE (Root Mean Squared Error), defined as follows:
X
(u,i)∈M
(Mu,i − f(u, i))2
where f(u, i) is your estimation, and the summation is over the training set or testing set, respectively. For
the grading, we will use another set-aside testing set, which is not released to you. If you observe your test
error is less than 1.00 without cheating (that is, training on the test set), you may expect to see the similar
performance on the unseen test set as well.
Note that we provide homework3.m just to help you evaluate your code easily. You are not expected to
alter or submit this to us. In other words, we will not use this file when we grade your submission. The only
file we take care of is myRecommender3.m.
Grading criteria:
• Your code should output U and V as specified. The dimension should match to the specification.
• We will test your output on another test dataset, which was not provided to you. The test RMSE on
this dataset should be at least 1.05 to get at least partial credit.
• We will measure elapsed time for learning. If your implementation takes longer than 3 minutes for
rank 5, you should definitely try to make your code faster or adjust parameters. Any code running
more than 5 minutes is not eligible for credit.
• Your code should not crash. Any crashing code will be not credited.
Report [10 pts]
In your report, show the performance (RMSE) both on your training set and test set, with varied lowRank.
(The default is set to 1, 3, and 5, but you may want to vary it further.) Discuss what you observe with
varied low rank. Also, briefly discuss how you decided your hyper-parameters (µ, λ).
Note
• Do not print anything in your code (e.g, iteration 1 : err=2.4382) in your final submission.
• Do not alter input and output format of the skeleton file. (E.g, adding a new parameter without
specifying its defalut value) Please make sure that you returned all necessary outputs according to the
given skeleton.
5
• Please do not use additional file. This task is simple enough that you can fit in just one file.
• Submit your code with the best parameters you found. We will grade without modifying your code.
(Applying cross-validation to find best parameters is fine, though you do not required to do.)
• Please be sure that your program finishes within a fixed number of iterations. Always think of a case
where your stopping criteria is not satisfied forever. This can happen anytime depending on the data,
not because your code is incorrect. For this, we recommend setting a maximum number of iteration in
addition to other stopping criteria.
Grand Prize
Similar to the Netflix competition held in 2006 to 2009, the student who achives the lowest RMSE on the
secret test set will earn the Grand Prize. We will give extra 10 bonus points to the winner, and share the
student’s code to everyone. (Note that the winner should satisfy all other grading criteria perfectly, including
answer sanity check and timing requirement. Otherwise, the next student will be considered as the winner.)
Typing Bonus
As usual, we will give 5 bonus points for typed submissions. Note that all questions should be typed to get
this credit.

1 Kernels [20 points]
(a) Identify which of the followings is a valid kernel. If it is a kernel, please write your answer
explicitly as ‘True’ and give mathematical proofs. If it is not a kernel, please write your answer
explicitly as ‘False’ and give explanations. [8 pts]
Suppose K1 and K2 are valid kernels (symmetric and positive definite) defined on Rm × Rm.
1. K(u, v) = αK1(u, v) + βK2(u, v), α, β ∈ R.
2. K(u, v) = K1(f(u), f(v)) where f : Rm → Rm. coefficients.
1https://matlab.mathworks.com/
2You can also use the environment for developing by registering online with your university email.
1
3.
K(u, v) = (
1 if ku − vk2 6 1
0 otherwise
(1)
4. Suppose K0
is a valid kernel.
K(u, v) = K0
(u, v)
p
K0(u, u)K0(v, v)
. (2)
(b) Write down kernelized version of Fisher’s Linear Discriminant Analysis using kernel trick.
Please provide full steps and all details of the method. [Hint: Use kernel to replace inner
products.] [12 pts]
2 Markov Random Field, Conditional Random Field [20 pts]
[a-b] A probability distribution on 3 discrete variables a,b,c is defined by P(a, b, c) = 1
Z
ψ(a, b, c) = 1
Z
φ1(a, b)φ2(b, c),
where the table for the two factors are given below.
a b φ1(a, b)
0 0 4
0 1 3
1 0 3
1 1 1
b c φ2(b, c)
0 0 3
0 1 2
0 2 1
1 0 4
1 1 1
1 2 3
(a) Compute the slice of the joint factor ψ(a, b, c) corresponding to b = 1. This is the table
ψ(a, b = 1, c). [5 pts]
(b) Compute P(a = 1, b = 1). [5 pts]
(c) Explain the difference between Conditional Random Fields and Hidden Markov Models
with respect to the following factors. Please give only a one-line explanation. [10 pts]
• Type of model – generative/discriminative
• Objective function optimized
• Require a normalization constant
3 Hidden Markov Model [50 pts]
This problem will let you get familiar with HMM algorithms by doing the calculations by hand.
[a-c] There are three coins (1, 2, 3), to throw them randomly, and record the result. S = 1, 2, 3; V = H, T
(Head or Tail); A, B, π is given as
2
A:
1 2 3
1 0.9 0.05 0.05
2 0.45 0.1 0.45
3 0.45 0.45 0.1
B:
1 2 3
H 0.5 0.75 0.25
T 0.5 0.25 0.75
π : π 1/3 1/3 1/3
(a) Given the model above, what’s the probability of observation O = H, T, H. [10 pts]
(b) Describe how to get the A, B, and π, when they are unknown. [10 pts]
(c) In class, we studied discrete HMMs with discrete hidden states and observations. The
following problem considers a continuous density HMM, which has discrete hidden states but
continuous observations. Let St ∈ 1, 2, …, n denote the hidden state of the HMM at time t, and
let Xt ∈ R denote the real-valued scalar observation of the HMM at time t. In a continuous
density HMM, the emission probability must be parameterized since the random variable Xt
is no longer discrete. It is defined as P(Xt = x|St = i) = N (µi
, σ2
i
). Given m sequences of
observations (each of length T), derive the EM algorithm for HMM with Gaussian observation
model. [14 pts]
(d) For each of the following sentences, say whether it is true or false and provide a short
explanation (one sentence or so). [16 pts]
• The weights of all incoming edges to a state of an HMM must sum to 1.
• An edge from state s to state t in an HMM denotes the conditional probability of going to state s given
that we are currently at state t.
• The ”Markov” property of an HMM implies that we cannot use an HMM to model a process that
depends on several time-steps in the past.
• The Baum-Welch algorithm is a type of an Expectation Maximization algorithm and as such it is
guaranteed to converge to the (globally) optimal solution.
4 Programming [30 pts]
In this problem, you will implement algorithm to analyze the behavior of SP500 index over a period of time.
For each week, we measure the price movement relative to the previous week and denote it using a binary
variable (+1 indicates up and 1 indicates down). The price movements from week 1 (the week of January
5) to week 39 (the week of September 28) are plotted below.
Consider a Hidden Markov Model in which xt denotes the economic state (good or bad) of week t
and yt denotes the price movement (up or down) of the SP500 index. We assume that x(t+1) = xt with
probability 0.8, and P(Yt|Xt)(yt = +1|xt = good) = P(Yt|Xt)(yt = −1|xt = bad) = q. In addition, assume that
P(X1)(x1 = bad) = 0.8. Load the sp500.mat, implement the algorithm, briefly describe how you implement
this and report the following :
(a) Assuming q = 0.7, plot P(Xt|Y )(xt = good|y) for t = 1, 2, …, 39. What is the probability that
the economy is in a good state in the week of week 39. [15 pts]
(b) Repeat (a) for q = 0.9, and compare the result to that of (a). Explain your comparison in
one or two sentences. [15 pts]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Cse/isye 6740 homeworks 1 to 4 solution[SOLVED] Cse/isye 6740 homeworks 1 to 4 solution
$25