In this problem, we will implement the EM algorithm for clustering. Start by importing the required packages and preparing the dataset.
import numpy as np import matplotlib . pyplot as plt
from numpy import linalg as LA from matplotlib . patches import Ellipse from sklearn . datasets . samples generator import make blobs from scipy . stats import multivariate normal
K = 3
NUMDATAPTS = 150
X, y = make blobs ( n samples=NUMDATAPTS, centers=K, shuffle=False , random state=0, cluster std =0.6)
g1 = np. asarray ([[2.0 , 0] , [ 0.9 , 1]]) g2 = np. asarray ([[1.4 , 0] , [0.5 , 0 . 7 ] ] ) mean1 = np.mean(X[ : int (NUMDATAPTS/K)])
mean2 = np.mean(X[ int (NUMDATAPTS/K):2 int (NUMDATAPTS/K)])
X[ : int (NUMDATAPTS/K)] = np. einsum( nj , ij >ni ,
X[ : int (NUMDATAPTS/K)] mean1 , g1) + mean1
X[ int (NUMDATAPTS/K):2 int (NUMDATAPTS/K)] = np. einsum( nj , ij >ni ,
X[ int (NUMDATAPTS/K):2 int (NUMDATAPTS/K)] mean2 , g2) + mean2
X[ : , 1 ] = 4
- Randomly initialize a numpy array mu of shape (K, 2) to represent the mean of the clusters, and initialize an array cov of shape (K, 2, 2) such that cov[k] is the identity matrix for each k. cov will be used to represent the covariance matrices of the clusters. Finally, set to be the uniform distribution at the start of the program.
- Write a function to perform the E-step:
def E step ():
gamma = np. zeros ((NUMDATAPTS, K))
. . .
. . .
return gamma
- Write a function to perform the M-step:
def M step(gamma):
. . . . . .
- Now write a loop that iterates through the E and M steps, and terminates after the change inlog-likelihood is below some threshold. At each iteration, print out the log-likelihood, and use the following function to plot the progress of the algorithm:
def plot result (gamma=None ):
ax = plt . subplot (111 , aspect= equal ) ax . setxlim ([5 , 5]) ax . set ylim ([5 , 5]) ax . scatter (X[: , 0] , X[: , 1] , c=gamma, s=50, cmap=None)
for k in range(K):
l , v = LA. eig (cov [ k ]) theta = np. arctan (v [1 ,0]/ v [0 ,0])
e = Ellipse ((mu[k , 0] , mu[k , 1]) , 6 l [0] , 6 l [1] , theta 180 / np. pi )
e . set alpha (0.5) ax . add artist (e) plt . show()
- Use sklearns KMeans module
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html to perform K-means clustering on the dataset, and compare both clustering results.
Problem 2
Let p and q be distributions on {1,2,3,4,5} such that, and
.
- Compute the cross-entropy H(p,q) in bits. Is H(q,p) = H(p,q)?
- Compute the entropies H(p) and H(q) in bits.
- Compute the KL-divergence DKL(p|q) in bits.
Show all working and leave your answers in fractions.
Problem 3
- Perform singular value decomposition (SVD) on the following matrix
.
- For a general design matrix X, why are the columns of the transformed matrix T = XV orthogonal?
Problem 4
In this problem, we will perform principal component analysis (PCA) on sklearns diabetes dataset. Start by importing the required packages and load the dataset.
import numpy as np from sklearn import decomposition from sklearn import datasets
X = datasets . load diabetes (). data
You can find out more on how to use sklearns PCA module from:
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
For this problem, make sure the design matrix is first normalized to have zero mean and unit standard deviation for each column.
- Write code to print the matrix V that will be used to transform the dataset, and print all the singular values.
- Now perform PCA on the dataset and print out the 3 most important components for the first10 data-points.
Problem 5
An AR(2) model assumes the form
rt = 0 + 1rt1 + 2rt2 + at,
where at is a white noise sequence. Show that if the model is stationary, then
(assume 1 + 2 6= 1);
(b) the ACF is given by
.
Reviews
There are no reviews yet.