[Solved] CS7641 Homework 2 -KMeans Clustering

$25

File Name: CS7641_Homework_2_-KMeans_Clustering.zip
File Size: 339.12 KB

SKU: [Solved] CS7641 Homework 2 -KMeans Clustering Category: Tag:
5/5 - (1 vote)

5/5 – (1 vote)

This notebook is tested under python 3.*.*, and the corresponding packages can be downloaded from miniconda. You may also want to get yourself familiar with several packages:

Please implement the functions that have raise NotImplementedError, and after you finish the coding, please delete or comment raise NotImplementedError.

In [1]:

################################## DO NOT CHANGE THIS CELL ##################################from __future__ import absolute_importfrom __future__ import print_functionfrom __future__ import division%matplotlib inline  import sysimport matplotlibimport numpy as npimport matplotlib.pyplot as pltfrom matplotlib import imagefrom mpl_toolkits.mplot3d import axes3dfrom tqdm import tqdmprint('Version information')print('python: {}'.format(sys.version))print('matplotlib: {}'.format(matplotlib.__version__))print('numpy: {}'.format(np.__version__))# Set random seed so output is all samenp.random.seed(1)# Load imageimport imageio
Version informationpython: 3.7.3 (default, Mar 27 2019, 16:54:48) [Clang 4.0.1 (tags/RELEASE_401/final)]matplotlib: 3.1.0numpy: 1.16.4

1. KMeans Clustering

KMeans is trying to solve the following optimization problem:

where one needs to partition the N observations into K clusters: S={S1,S2,,SK}S={S1,S2,,SK} and each cluster has ii as its center.

argminSi=1KxjSi||xji||2argminSi=1KxjSi||xji||2

1.1 pairwise distance

In this section, you are asked to implement pairwise_dist function.

Given XRNxDXRNxD and YRMxDYRMxD, obtain the pairwise distance matrix distRNxMdistRNxM using the euclidean distance metric, where disti,j=||XiYj||2disti,j=||XiYj||2.

DO NOT USE FOR LOOP in your implementation they are slow and will make your code too slow to pass our grader. Use array broadcasting instead.

1.2 KMeans Implementation

In this section, you are asked to implement _init_centers [5pts], _update_assignment [10pts], _update_centers [10pts] and _get_loss function [5pts].

For the function signature, please see the corresponding doc strings.

1.3 Find the optimal number of clusters

In this section, you are asked to implement find_optimal_num_clusters function.

You will now use the elbow method to find the optimal number of clusters.

1.4 Autograder test to find centers for data points

To obtain these 5 points, you need to be pass the tests set up in the autograder. These will test the centers created by your implementation. Be sure to upload the correct files to obtain these points.

In [2]:

class KMeans(object):        def __init__(self): #No need to implement        pass        def pairwise_dist(self, x, y): # [5 pts]        """        Args:            x: N x D numpy array            y: M x D numpy array        Return:                dist: N x M array, where dist2[i, j] is the euclidean distance between                 x[i, :] and y[j, :]                """        #l2 norm, ord=2        dist = np.linalg.norm(x[:, np.newaxis, :] - y, ord=2, axis=2)        return dist#         raise NotImplementedError    def _init_centers(self, points, K, **kwargs): # [5 pts]        """        Args:            points: NxD numpy array, where N is # points and D is the dimensionality            K: number of clusters            kwargs: any additional arguments you want        Return:            centers: K x D numpy array, the centers.         """        #K random indices from points        indices = np.random.choice(points.shape[0], size=K, replace=False)        #form the numpy array with these indices        centers = points[indices,:]        return centers        #         raise NotImplementedError    def _update_assignment(self, centers, points): # [10 pts]        """        Args:            centers: KxD numpy array, where K is the number of clusters, and D is the dimension            points: NxD numpy array, the observations        Return:            cluster_idx: numpy array of length N, the cluster assignment for each point                    Hint: You could call pairwise_dist() function.        """        #distances between points and all cluster centers        distances = self.pairwise_dist(points,centers)        #index of minimum distance for each row        cluster_idx = np.argmin(distances,axis=1)        return cluster_idx#         raise NotImplementedError    def _update_centers(self, old_centers, cluster_idx, points): # [10 pts]        """        Args:            old_centers: old centers KxD numpy array, where K is the number of clusters, and D is the dimension            cluster_idx: numpy array of length N, the cluster assignment for each point            points: NxD numpy array, the observations        Return:            centers: new centers, K x D numpy array, where K is the number of clusters, and D is the dimension.        """        K,D = old_centers.shape[0], old_centers.shape[1]        #intialize centers as zero array        centers = np.zeros((K,D))        for i in range(K):            #find mean of all points having i as cluster idx            centers[i] = np.mean(points[cluster_idx==i, :], axis=0)        return centers        #         raise NotImplementedError    def _get_loss(self, centers, cluster_idx, points): # [5 pts]        """        Args:            centers: KxD numpy array, where K is the number of clusters, and D is the dimension            cluster_idx: numpy array of length N, the cluster assignment for each point            points: NxD numpy array, the observations        Return:            loss: a single float number, which is the objective function of KMeans.         """        #find squared distances between all points and cluster centers        distances = np.linalg.norm(points[:, np.newaxis, :] - centers, ord=2, axis=2) ** 2        #select distance from cluster center        distance_from_cluster_center = distances[np.arange(len(distances)), cluster_idx]        #loss is sum of all these distances        loss = np.sum(distance_from_cluster_center)        return loss        #         raise NotImplementedError            def __call__(self, points, K, max_iters=100, abs_tol=1e-16, rel_tol=1e-16, verbose=False, **kwargs):        """        Args:            points: NxD numpy array, where N is # points and D is the dimensionality            K: number of clusters            max_iters: maximum number of iterations (Hint: You could change it when debugging)            abs_tol: convergence criteria w.r.t absolute change of loss            rel_tol: convergence criteria w.r.t relative change of loss            verbose: boolean to set whether method should print loss (Hint: helpful for debugging)            kwargs: any additional arguments you want        Return:            cluster assignments: Nx1 int numpy array            cluster centers: K x D numpy array, the centers            loss: final loss value of the objective function of KMeans        """        centers = self._init_centers(points, K, **kwargs)        for it in range(max_iters):            cluster_idx = self._update_assignment(centers, points)            centers = self._update_centers(centers, cluster_idx, points)            loss = self._get_loss(centers, cluster_idx, points)            K = centers.shape[0]            if it:                diff = np.abs(prev_loss - loss)                if diff < abs_tol and diff / prev_loss < rel_tol:                    break            prev_loss = loss            if verbose:                print('iter %d, loss: %.4f' % (it, loss))        return cluster_idx, centers, loss        def find_optimal_num_clusters(self, data, max_K=15): # [10 pts]        """Plots loss values for different number of clusters in K-Means                Args:            image: input image of shape(H, W, 3)            max_K: number of clusters        Return:            losses: an array of loss denoting the loss of each number of clusters        """        losses = []        #for each value of K, find the loss and append to array        for i in range(1,max_K+1):            cluster_idx, centers, loss = self.__call__(data, i)            losses.append(loss)        return losses        #         raise NotImplementedError

In [3]:

# Helper function for checking the implementation of pairwise_distance fucntion. Please DO NOT change this function# TEST CASEx = np.random.randn(2, 2)y = np.random.randn(3, 2)print("*** Expected Answer ***")print("""==x==[[ 1.62434536 -0.61175641] [-0.52817175 -1.07296862]]==y==[[ 0.86540763 -2.3015387 ] [ 1.74481176 -0.7612069 ] [ 0.3190391  -0.24937038]]==dist==[[1.85239052 0.19195729 1.35467638] [1.85780729 2.29426447 1.18155842]]""")print("
*** My Answer ***")print("==x==")print(x)print("==y==")print(y)print("==dist==")print(KMeans().pairwise_dist(x, y))
*** Expected Answer ***==x==[[ 1.62434536 -0.61175641] [-0.52817175 -1.07296862]]==y==[[ 0.86540763 -2.3015387 ] [ 1.74481176 -0.7612069 ] [ 0.3190391  -0.24937038]]==dist==[[1.85239052 0.19195729 1.35467638] [1.85780729 2.29426447 1.18155842]]*** My Answer ***==x==[[ 1.62434536 -0.61175641] [-0.52817175 -1.07296862]]==y==[[ 0.86540763 -2.3015387 ] [ 1.74481176 -0.7612069 ] [ 0.3190391  -0.24937038]]==dist==[[1.85239052 0.19195729 1.35467638] [1.85780729 2.29426447 1.18155842]]

In [4]:

def image_to_matrix(image_file, grays=False):    """    Convert .png image to matrix    of values.    params:    image_file = str    grays = Boolean    returns:    img = (color) np.ndarray[np.ndarray[np.ndarray[float]]]    or (grayscale) np.ndarray[np.ndarray[float]]    """    img = plt.imread(image_file)    # in case of transparency values    if len(img.shape) == 3 and img.shape[2] > 3:        height, width, depth = img.shape        new_img = np.zeros([height, width, 3])        for r in range(height):            for c in range(width):                new_img[r, c, :] = img[r, c, 0:3]        img = np.copy(new_img)    if grays and len(img.shape) == 3:        height, width = img.shape[0:2]        new_img = np.zeros([height, width])        for r in range(height):            for c in range(width):                new_img[r, c] = img[r, c, 0]        img = new_img    return img

In [5]:

image_values = image_to_matrix('./images/bird_color_24.png')r = image_values.shape[0]c = image_values.shape[1]ch = image_values.shape[2]# flatten the image_valuesimage_values = image_values.reshape(r*c,ch)k = 6 # feel free to change this valuecluster_idx, centers, loss = KMeans()(image_values, k)updated_image_values = np.copy(image_values)# assign each pixel to cluster meanfor i in range(0,k):    indices_current_cluster = np.where(cluster_idx == i)[0]    updated_image_values[indices_current_cluster] = centers[i]    updated_image_values = updated_image_values.reshape(r,c,ch)plt.figure(None,figsize=(9,12))plt.imshow(updated_image_values)plt.show()

In [6]:

KMeans().find_optimal_num_clusters(image_values)

Out[6]:

[25575.970952916756, 14136.693323245287, 10225.60985715669, 6993.510281152104, 6388.54308345205, 4749.894457724676, 3383.379286682485, 3418.0366633949584, 2738.2662250749277, 2554.129469081243, 2065.325028335666, 1929.4098314621665, 1965.2534688501296, 1706.074071535149, 1580.6903867544215]

Silhouette Coefficient Evaluation

The average silhouette of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a data instance is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighbouring cluster.

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from 1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

In [7]:

def intra_cluster_dist(cluster_idx, data, labels): # [4 pts]    """    Calculates the average distance from a point to other points within the same cluster        Args:        cluster_idx: the cluster index (label) for which we want to find the intra cluster distance        data: NxD numpy array, where N is # points and D is the dimensionality        labels: 1D array of length N where each number indicates of cluster assignement for that point    Return:        intra_dist_cluster: 1D array where the i_th entry denotes the average distance from point i                             in cluster denoted by cluster_idx to other points within the same cluster    """    # indices of rows with cluster==cluster_idx    indices = [i for i in range(len(labels)) if labels[i]==cluster_idx]    # find rows    rows_of_interest = data[indices]    # initialize intra_dist_cluster    intra_dist_cluster = np.zeros((rows_of_interest.shape[0]))        for i in range(intra_dist_cluster.shape[0]):        x = np.concatenate((rows_of_interest[:i,:], rows_of_interest[i+1:,:]))        y = rows_of_interest[i, :]        intra_dist_cluster[i] = np.mean(np.sqrt(np.sum((x - y[None,:])**2, axis = 1)))            return intra_dist_cluster    #     raise NotImplementedErrordef inter_cluster_dist(cluster_idx, data, labels): # [4 pts]    """    Calculates the average distance from one cluster to the nearest cluster    Args:        cluster_idx: the cluster index (label) for which we want to find the intra cluster distance        data: NxD numpy array, where N is # points and D is the dimensionality        labels: 1D array of length N where each number indicates of cluster assignement for that point    Return:        inter_dist_cluster: 1D array where the i-th entry denotes the average distance from point i in cluster                            denoted by cluster_idx to the nearest neighboring cluster    """    # find number of distinct clusters    num_clusters = len(np.unique(labels))    # separate data points by cluster    separate_data = []    for j in range(num_clusters):        indices = [i for i in range(len(labels)) if labels[i] == j]        separate_data.append(data[indices])    # indices of rows with cluster==cluster_idx    indices = [i for i in range(len(labels)) if labels[i] == cluster_idx]    # find rows    rows_of_interest = data[indices]    # initialize inter_dist_cluster    inter_dist_cluster = np.zeros((rows_of_interest.shape[0]))    # find min. average distance from each point to nearest cluster    for i in range(rows_of_interest.shape[0]):        y = rows_of_interest[i, :]        distances = []        for j in range(len(separate_data)):            if j != cluster_idx:                x = separate_data[j]                distances.append(np.mean(np.sqrt(np.sum((x - y[None, :]) ** 2, axis=1))))        inter_dist_cluster[i] = min(distances)    return inter_dist_cluster    #     raise NotImplementedErrordef silhouette_coefficient(data, labels): #[2 pts]    """    Finds the silhouette coefficient of the current cluster assignment        Args:        data: NxD numpy array, where N is # points and D is the dimensionality        labels: 1D array of length N where each number indicates of cluster assignement for that point    Return:        silhouette_coefficient: Silhouette coefficient of the current cluster assignment    """    #unique clusters    clusters = np.unique(labels)    #initialize silhouette coefficient    SC = 0        #for each value of cluster_idx, find inter and intra cluster distance, add SC(i) to SC    for cluster_idx in clusters:        intra_dist_cluster = intra_cluster_dist(cluster_idx, data, labels)        inter_dist_cluster = inter_cluster_dist(cluster_idx, data, labels)        #max of inter and intra cluster distance for each point        max_of_both = np.max(np.array([intra_dist_cluster, inter_dist_cluster]), axis=0)        #calculate SC of each point        S = [x/y for x, y in zip(inter_dist_cluster - intra_dist_cluster, max_of_both)]        #add sum of all SCs for that cluster to SC        SC+= np.sum(S)            #SC is average of sum of all individual SCs    SC/= data.shape[0]        return SC                #     raise NotImplementedError

In [8]:

def plot_silhouette_coefficient(data, max_K=15):    """    Plot silhouette coefficient for different number of clusters, no need to implement    """    clusters = np.arange(2, max_K+1)        silhouette_coefficients = []    for k in range(2, max_K+1):        labels, _, _ = KMeans()(data, k)        silhouette_coefficients.append(silhouette_coefficient(data, labels))    plt.plot(clusters, silhouette_coefficients)    return silhouette_coefficientsdata = np.random.rand(200,3) * 100plot_silhouette_coefficient(data)

Out[8]:

[0.27434894586234465, 0.23974317309219537, 0.25170126787277375, 0.29110188891166133, 0.32392134092333413, 0.2978439345698108, 0.24118691265277048, 0.2861673492358512, 0.27509945216139964, 0.2710429481302329, 0.27049817823622974, 0.28742447415210476, 0.28737886551571834, 0.2721522663631721]

Limitation of K-Means

One of the limitations of K-Means Clustering is that it dependes largely on the shape of the dataset. A common example of this is trying to cluster one circle within another (concentric circles). A K-means classifier will fail to do this and will end up effectively drawing a line which crosses the circles. You can visualize this limitation in the cell below.

In [9]:

# visualize limitation of kmeans, do not have to implementfrom sklearn.datasets.samples_generator import (make_circles, make_moons)X1, y1 = make_circles(factor=0.5, noise=0.05, n_samples=1500)X2, y2 = make_moons(noise=0.05, n_samples=1500)def visualise(X, C, K):# Visualization of clustering. You don't need to change this function       fig, ax = plt.subplots()    ax.scatter(X[:, 0], X[:, 1], c=C,cmap='rainbow')    plt.title('Visualization of K = '+str(K), fontsize=15)    plt.show()    passcluster_idx1, centers1, loss1 = KMeans()(X1, 2)visualise(X1, cluster_idx1, 2)cluster_idx2, centers2, loss2 = KMeans()(X2, 2)visualise(X2, cluster_idx2, 2)

2. EM algorithm [20 pts]

2.1 Performing EM Update [10 pts]

A univariate Gaussian Mixture Model (GMM) has two components, both of which have their own mean and standard deviation. The model is defined by the following parameters:

For a dataset of N datapoints, find the following:

2.1.1. Write the marginal probability of x, i.e. p(x)p(x) [2pts]

2.1.2. E-Step: Compute the posterior probability, i.e, p(zi=k|xi)p(zi=k|xi), where k = {0,1} [2pts]

2.1.3. M-Step: Compute the updated value of (You can keep fixed for this) [3pts]

2.1.4. M-Step: Compute the updated value for (You can keep fixed for this) [3pts]

zBernoulli()zBernoulli()

p(x|z=0)N(,)p(x|z=0)N(,)

p(x|z=1)N(2,3)p(x|z=1)N(2,3)

2.1.1 Marginal probability of x, i.e. p(x)p(x)Given: A univariate Gaussian mixture model with two components, defined by their parameters as:

Note: Considering p(z=0)=p(z=0)=, and and 33 as standard deviation as suggested by TAs on piazza:

Marginal probability of x can be calculated as:

zBernoulli()zBernoulli()

p(x|z=0)N(,)p(x|z=0)N(,)

p(x|z=1)N(2,3)p(x|z=1)N(2,3)

p(x)=zp(z)p(x|z)=p(z=0)p(x|z=0)+p(z=1)p(x|z=1)=N(,)+(1)N(2,3)p(x)=zp(z)p(x|z)=p(z=0)p(x|z=0)+p(z=1)p(x|z=1)=N(,)+(1)N(2,3)

2.1.2 E-Step: Posterior probability, i.e, p(zi=k|xi)p(zi=k|xi), where k = {0,1}For k=0, we get:

Similarly, for k=0, we get:

p(zi=0|xi)=p(zi=0)p(xi|zi=0)p(x)=N(,)N(,)+(1)N(2,3)p(zi=0|xi)=p(zi=0)p(xi|zi=0)p(x)=N(,)N(,)+(1)N(2,3)

p(zi=1|xi)=p(zi=1)p(xi|zi=1)p(x)=(1)N(2,3)N(,)+(1)N(2,3)p(zi=1|xi)=p(zi=1)p(xi|zi=1)p(x)=(1)N(2,3)N(,)+(1)N(2,3)

2.1.3 M-Step: Computing the updated value of To find the updated value of , we need to solve the optimization problem:

To solve this optimization problem, we will find its derivative and equate it to 0.

Representing p(Z=1|X,oldp(Z=1|X,old as (z=1)(z=1) and p(Z=0|X,oldp(Z=0|X,old as (z=0)(z=0) and using the value of the derivative of the Gaussian pdf w.r.t. , we get:

Equating the above value of derivative to 0, we get:

This gives us the updated value of as:

new=argmaxzp(Z|X,old)ln(p(X,Z|)new=argmaxzp(Z|X,old)ln(p(X,Z|)

zp(Z|X,old)ln(p(X,Z|)=[(z=1)ln((1)N(2,3))+(z=0)ln(N(,))]=(z=1)N(2,3)+(z=0)N(,)=(z=1)2(x2)92+(z=0)(x)2zp(Z|X,old)ln(p(X,Z|)=[(z=1)ln((1)N(2,3))+(z=0)ln(N(,))]=(z=1)N(2,3)+(z=0)N(,)=(z=1)2(x2)92+(z=0)(x)2

(z=1)2(x2)92+(z=0)(x)2=0(z=1)2(x2)92+(z=0)(x)2=0

2x(z=1)4(z=1)+9x(z=0)9(z=0)=02x(z=1)4(z=1)+9x(z=0)9(z=0)=0

x(2(z=1)+9(z=0))=(9(z=0)+4(z=1))x(2(z=1)+9(z=0))=(9(z=0)+4(z=1))

=x(2(z=1)+9(z=0))9(z=0)+4(z=1)=x(2(z=1)+9(z=0))9(z=0)+4(z=1)

2.1.4 M-Step: Computing the updated value for :

For this part, we are solving the same optimization problem, but this time, we will find the partial derivative w.r.t. .

Representing p(Z=1|X,oldp(Z=1|X,old as (z=1)(z=1) and p(Z=0|X,oldp(Z=0|X,old as (z=0)(z=0) and using the value of the derivative of the Gaussian pdf w.r.t. , we get:

Equating the above value of derivative to 0, we get:

This gives us the updated value of as:

zp(Z|X,old)ln(p(X,Z|)=[(z=1)ln((1)N(2,3))+(z=0)ln(N(,))]=(z=1)N(2,3)+(z=0)N(,)=(z=1)(x2)29293+(z=0)(x)223zp(Z|X,old)ln(p(X,Z|)=[(z=1)ln((1)N(2,3))+(z=0)ln(N(,))]=(z=1)N(2,3)+(z=0)N(,)=(z=1)(x2)29293+(z=0)(x)223

(z=1)(x2)29293+(z=0)(x)223=0(z=1)(x2)29293+(z=0)(x)223=0

(z=1)((x2)292)+9(z=0)((x)22)=0(z=1)((x2)292)+9(z=0)((x)22)=0

(z=1)(x2)29(z=1)2+9(z=0)(x)29(z=0)2=0(z=1)(x2)29(z=1)2+9(z=0)(x)29(z=0)2=0

2=(z=1)(x2)2+9(z=0)(x)29((z=1)+(z=0))2=(z=1)(x2)2+9(z=0)(x)29((z=1)+(z=0))

=13(z=1)(x2)2+9(z=0)(x)2(z=1)+(z=0)=13(z=1)(x2)2+9(z=0)(x)2(z=1)+(z=0)

2.2 EM Algorithm in ABO Blood Groups [10 pts]

In the ABO blood group system, each individual has a phenotype and a genotype as shown below. The genotype is made of underlying alleles (A, B, O).

In a research experiment, scientists wanted to model the distribution of the genotypes of the population. They collected the phenotype information from the participants as this could be directly observed from the individuals blood group. The scientists, however want to use this data to model the underlying genotype information. In order to help them obtain an understanding, you suggest using the EM algorithm to find out the genotype distribution.

You know that the probability of that an allele is present in an individual is independent of the probability of any other allele, i.e, P(AO)=P(OA)=P(A)P(O)P(AO)=P(OA)=P(A)P(O) and so on. Also note that the genotype pairs: (AO, OA) and (BO, OB) are identical and can be treated as AO, BO respectively. You also know that the alleles follow a multinomial distribution.

Let nA,nB,nO,nABnA,nB,nO,nAB be the number of individuals with the phenotypes A, B, O and AB respectively. Let nAA,nAO,nBB,nBO,nABnAA,nAO,nBB,nBO,nAB be the numbers of individuals with genotypes AA, AO, BB, BO and AB respectively. The satisfy the following conditions:

PhenotypeAAABBBOABGenotypeAAAOOABBBOOBOOABPhenotypeGenotypeAAAAAOAOABBBBBOBOBOOOABAB

p(O)=1p(A)p(B)p(O)=1p(A)p(B)

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CS7641 Homework 2 -KMeans Clustering
$25