Recall that we say that a function f is convex if the domain of f is a convex set and
f(x + (1 )y) f(x) + (1 )f(y), for all x,y in the domain of f, 0 1.
And strictly convex if
f(x + (1 )y) < f(x) + (1 )f(y), for all x =6 y in the domain of f, 0 < < 1.
Prove the following assertions.
- The affine function f(x) = ax + b is convex, where a,b and x are scalars.
- If multiple functions fn(x) are convex over a fixed domain, then their sum g(x) = Pn fn(x) is convex over the same domain.
- Take f,g : R R to be convex functions and g to be increasing. Then g f is also convex.
Note: A function g is increasing if a b g(a) g(b). An example of a convex and increasing function is exp(x),x R.
- If f : R R is convex, then g : RD R, where g(x) := f(w>x + b), is also convex. Here, w is a constant vector in RD, b is a constant in R and x RD.
- Let f : RD R be strictly convex. Let x? RD be a global minimizer of f. Show that this global minimizer is unique. Hint: Do a proof by contradiction.
1 Extension of Logistic Regression to Multi-Class Classification
Suppose we have a classification dataset with N data example pairs {xn,yn}, n [1,N], and yn is a categorical variable over K categories, yn {1,2,,K}. We wish to fit a linear model in a similar spirit to logistic regression, and we will use the softmax function to link the linear inputs to the categorical output, instead of the logistic function.
We will have K sets of parameters wk, and define and compute the probability distribution of the output as follows,
.
Note that we indeed have a probability distribution, as. To make the model identifiable, we will fix wK to 0, which means we have K1 sets of parameters to learn. As in logistic regression, we will assume that each yn is i.i.d., i.e.,
.
- Derive the log-likelihood for this model.
- Derive the gradient with respect to each wk.
- Show that the negative of the log-likelihood is convex with respect to wk.
2 Mixture of Linear Regression
In Project-I, you worked on a regression dataset with two or more distinct clusters. For such datasets, a mixture of linear regression models is preferred over just one linear regression model.
Consider a regression dataset with N pairs {yn,xn}. Similar to Gaussian mixture model (GMM), let rn {1,2,,K} index the mixture component. Distribution of the output yn under the kth linear model is defined as follows:
Here, wk is the regression parameter vector for the kth model with w being a vector containing all wk. Also, x.
- Define rn to be a binary vector of length K such that all the entries are 0 except a kth entry, i.e., rnk = 1, implying that xn is assigned to the kth Rewrite the likelihood p(yn|xn,w,rn) in terms of rnk.
- Write the expression for the joint distribution p(y|X,w,r) where r is the set of all r1,r2,,rN.
- Assume that rn follows a multinomial distribution p(rn = k|) = k, with = [1,2,,K]. Derive the marginal distribution p(yn|xn,w,) obtained after marginalizing rn
- Write the expression for the maximum likelihood estimator L(w,) := logp(y|X,w,) in terms of data y and X, and parameters w and .
- Is L jointly-convex with respect to w and ? Is the model identifiable? Prove your answers.
2

![[Solved] CS433 Exercise 6- 1 Convexity](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] CS433-Homework 1- Tweet analysis with MapReduce](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.