5/5 - (1 vote)

5CCS2INT

Name: [SOLVED] 5CCS2INT Introduction to Artificial Intelligence Unsupervised Learning
Brand: Assignment Chef
SKU: 48087
Price: 25 USD
Availability: InStock
Rating: 5 (1 reviews)

Introduction to Artificial Intelligence

Week 6 – Unsupervised Learning

Written by David Kohan Marzagão

Last updated on March 16, 2024 (found any typos? Please let me know!)

1. Introduction 1
2. K-Means Clustering Algorithm 2
  1. Introduction and Motivation 2
  2. The Algorithm 4
  3. An Example 5
  4. A Few Words on Initialisation 8
  5. The property behind the squared error metric in K-means. 9
3. Hierarchical Clustering Algorithm 10
  1. Introduction and Motivation 10
  2. An Example 10
  3. Cluster Distance Criteria 13
  4. The Algorithm 16
  5. Dendograms 18
4. Want Some More Details? 20
5. Version Control 20

Bibliography 21

1. Introduction
  
  In the unsupervised learning setting, we look for patterns in the data without having the labels of such data points. For examples, if we have pictures of cats and dogs, an
  
  unsupervised algorithm may be able to identify two groups of similar pictures in the dataset, without of course being able to ‘name’ them.
  
  A very common technique to do that is clustering. In this chapter, we will look at two common clustering techniques: K-Means Clustering, and Hierarchical Clustering.
2. K-Means Clustering Algorithm
  1. Introduction and Motivation
    
    We start off this section with a motivational example.
    
    Example 6.1. Consider the points in Figure 6.1. There are 6 points, and imagine we want to create 2 clusters (that is, K = 2) but it is unclear which cluster should point p4 belong to. In more precise terms, should we have
    
    C˜1 = {p1, p2, p3, p4} and C˜2 = {p5, p6} (6.1)
    
    or, should we have
    
    Cˆ1 = {p1, p2, p3} and Cˆ2 = {p4, p5, p6} (6.2)
    
    instead?
    
    In order to answer the question in example 6.1, there is a key point to address: what does it mean to be a better cluster assignment than another?
    
    To answer this question, we need a metric for within-cluster variance. That is, we want to minimise that value which we will call V (Ck). Here we will use the squared Euclidean distance as a measure for within cluster variation, that is for a given cluster Ck made of points pi, the within-cluster variation is calculated by:
    
    2
    
    1 Σ
    
    V (C ) = d(p , p ′ ) (6.3)
    
    where
    
    k |Ck|
    
    i i
    
    i,i′∈Ck
    
    Σ
    
    D
    
    d(pi, pi′ )2 = (pij − pi′j )2 (6.4)
    
    j=1
    
    Note that we do not count each pair i, i′ twice. For example, for the problem in Example
    
    8
    
    p5 = (6, 7)
    
    7
    
    p6 = (7, 6)
    
    6
    
    5
    
    p4 = (4, 4)
    
    4
    
    p2 = (2, 3)
    
    3
    
    p1 = (1, 2)
    
    2
    
    1
    
    -1
    
    0
    
    0
    
    1
    
    p3 = (3, 0)
    
    2 3 4
    
    5
    
    6
    
    7
    
    8
    
    -1
    
    Figure 6.1: An example of observations p1, . . . , p7 to be clustered into 2 clusters. Where should p4 go?
    
    6.1, let us calculate V (Cˆ2):
    
    2
    
    2
    
    |C2|
    
    i
    
    i
    
    V (Cˆ ) = 1 Σ d(p , p ′ )
    
    i,i′∈C2
    
    3
    
    1
    
    4
    
    5
    
    4
    
    6
    
    5
    
    6
    
    = 1 (d(p , p )2 + d(p , p )2 + d(p , p )2
    
    = (13 + 13 + 2)
    
    3
    
    ≈ 9.33
    
    Using these calculations, one can finally solve the problem posed in Example 6.1 and decide which clustering is better. The next exercise will allow you to calculate the clustering with the lower within-cluster variance.
    
    Exercise 1. Recall cluster options from Example 6.1. Either C˜1 = {p1, p2, p3, p4} and C˜2 =
    
    {p5, p6} or Cˆ1 = {p1, p2, p3} and Cˆ2 = {p4, p5, p6}. Calculate V (C˜1), V (C˜2), and V (Cˆ1). Then, use also V (Cˆ2) to determine which of the two cluster assignments have the lower
    
    within-cluster variance.
    
    Solution to Exercise 1
  2. The Algorithm
    
    Σ
    
    The whole point is that there are too many cluster assignments to check! More precisely, only considering K = 2, the number of cluster assignments for, say, n = 20 is
    
    20
    
    i=0
    
    20 = 220 = 1048575. This may not look much, but the danger is that this number
    
    j
    
    grows exponentially on n, therefore becoming intractable really quickly in more realistic data sets.
    
    We need, therefore another algorithm to cluster a set of points. In a nutshell, the K-means in most libraries work roughly according to the following definition. Note that libraries in general implement some way to optimise what we describe here.
    
    Definition 6.2 (K-Means Clustering Algorithm) Let p1, . . . , pn ∈ RD be a set of points in D dimensions. Let K be the desired number of clusters to be found. Then K-means clustering algorithm does the following:
    
    k
    1. Select k different data points to be initial cluster centre at time t = 0, µ(0). Clusters at time t are called C(t), . . . , C(t). There are no clusters (yet) at time t = 0. Move
      
      1 K
      
      time to t = 1.
      
      k
    2. At time t, Assign each data point pi to the cluster C(t) with closest cluster centre
      
      k
      
      µ(t−1). Break ties assigning the point to the cluster centre with the lowest index among the closest ones.
    3. For each cluster, recalculate their cluster centres as the average of all points, i.e., for each k ∈ 1, . . . , K,
      
      k
      
      i
      
      µ = p
      
      (t) 1 Σ
      
      k
      
      C
      
      (t)
      
      k
      
      i∈C(t)
      
      (6.5)
    4. If clusters changed, go back to step 2 moving time forward to t + 1. If cluster have not changed, stop and return final clusters.
  3. An Example
    
    Consider the set of points as below:
    
    2 5
    
    2 6
    
    P = 4 5
    
    1 8
    
    6 4
    
     
    
    4 4
    
    1 1
    
    4 7
    
    (6.6)
    
    In other words, p1 = (2, 5), p2 = (2, 6), and so on. Let’s say we want to define 3 cluster (so
    
    K = 3), and that the initial cluster centres (or centroids) are the following:
    
    1
    
    µ(0) = p1 = (2, 5)
    
    2
    
    µ(0) = p2 = (2, 6)
    
    3
    
    µ(0) = p3 = (6, 4)
    
    Figure 6.2 depicts the entire process.
    
    Initial Cluster Centers
    
    8
    
    7
    
    6
    
    (2.00, 6.00)
    
    5
    
    (2.00, 5.00)
    
    4
    
    3
    
    2
    
    1
    
    8
    
    7
    
    6
    
    5
    
    4
    
    (6.00, 4.00)
    
    3
    
    2
    
    1
    
    End of Iteration 1:
    
    (2.33, 7.00)
    
    (2.33, 3.67)
    
    (5.00, 4.00)
    
    1 2 3 4 5 6
    
    End of Iteration 2:
    
    (2.33, 7.00)
    
    (4.67, 4.33)
    
    (1.50, 3.00)
    
    8
    
    7
    
    6
    
    5
    
    4
    
    3
    
    2
    
    1
    
    1 2 3 4 5 6
    
    End of Iteration 3:
    
    8
    
    (2.25, 6.50)
    
    7
    
    6
    
    5
    
    4 (4.67, 4.33)
    
    3
    
    2
    
    1
    
    1 2 3 4 5 6
    
    End of Iteration 4:
    
    8
    
    (2.25, 6.50)
    
    7
    
    6
    
    5
    
    4 (4.67, 4.33)
    
    3
    
    2
    
    1
    
    (1.00, 1.00)
    
    1
    
    8
    
    7
    
    6
    
    5
    
    4
    
    3
    
    2
    
    1
    
    2 3 4 5 6
    
    End of Iteration 5:
    
    (2.25, 6.50)
    
    (4.67, 4.33)
    
    (1.00, 1.00)
    
    1 2 3 4 5 6
    
    (1.00, 1.00)
    
    1 2 3 4 5 6
    
    Figure 6.2: The entire k-means clustering of these 8 data points in K = 3 clusters. Crosses represent cluster centres at that time, whereas colours represent clusters. Note that, in a real implementation, iteration number 5 would not even start. That is because at the end of iteration 4 we would identify that clusters have not changed from the end of iteration 3, and therefore terminate the algorithm.
    
    Exercise 2. Consider the data points (or observations) as in Figure 6.1. Define K = 2
    
    and µ(0) = (1, 2) and µ(0) = (6, 7). What is C(1), C(1), µ(1), and µ(1)? What about C(2),
    
    1 2
    
    C(2), µ(2), and µ(2)?
    
    1 2 1 2 1
    
    2 1 2
    
    Solution to Exercise 2
    
    Exercise 3. This exercise is for you to note that different cluster initialisation may change the result. In particular, it may give us a local minimum that is not the overall best clustering we can do. Consider the same data points as in Figure 6.1. Define K = 2 and now invert the cluster indexes, that is, µ(0) = (6, 7) and µ(0) = (1, 2). What is C(1), C(1),
    
    1 2 1 2
    
    µ(1), and µ(1)? What about C(2), C(2), µ(2), and µ(2)?
    
    1 2 1 2 1 2
    
    Solution to Exercise 3
  4. A Few Words on Initialisation
    
    You may be wondering ‘but how should I choose my initial cluster centers’? Definition 6.2 does not specify how exactly this should be done. In particular, you may be worried about choosing a cluster centre initialisation that leads to a local minima (that is not a global minima).i We have seen an example in which this can happen (Exercise 3). There are two
    
    iWe try to be careful about this terminology because a global minimum is, in particular, a local minimum. So our worry is to land on *another* local minimum, that is, one that does not happen to also be a global minimum. Recall that there may be more than one global minima (several points that are
    
    main ways to try to solve this problem (which can be combined!):
    1. Run the clustering several times, with several random initilisations for your cluster centres and pick the best one.
    2. Select your initial centres in a smart way (for example, do not take points that are too close to each other, as it may lead to bad local minima or take too long for the process to finish).
      
      Both strategies above are used by one of the main Python implementations of k-means (the one by scikit-learn). The several initialisations defined by a parameter called n_init, and the parameter k-means++ selects initial cluster centre based on an empirical probability distribution for the points. The intricacies of how this latter strategy works is of course beyond the scope of this course, but I wanted to let you know that there are techniques when choosing how to initialise your cluster centres.
  5. The property behind the squared error metric in K-means.
    
    At this point, you may be wondering why is that the case that we can suddenly use the cluster centres to look for better cluster assignments (i.e., lower sum of within-cluster vari- ance). Didn’t we have to calculate the distance from one data point to all the other ones in the cluster? How come just looking for a point closes cluster centre is enough? The propo- sition that follows proves that this indeed works: by reducing sum of distances to cluster centres, one also reduces the sum within-cluster variances! Note that this proposition is non-examinable.
    
    i
    
    i
    
    i
    
    k
    
    Proposition 6.3 Let Ck be a cluster and µk its cluster centre. Then, its within-cluster variance is equal to two times the sum of square distances from each point to the cluster centre. In other words:
    
    i,i′∈Ck
    
    1 Σ
    
    |Ck|
    
    d(p , p ′ )2 = 2 Σ d(p , µ
    
    i∈Ck
    
    )2 (6.7)
    
    Remark 6.4. Note that Proposition 6.3 in some ways justifies the use of euclidean dis- tances (or their squares) as what we want to minimise. If we had used Manhattan distance instead, it is not clear how we would be minimising distances by deciding the cluster a point should be only by the closest cluster centre to it.
    
    equally as good and better than anything else).
3. Hierarchical Clustering Algorithm
  1. Introduction and Motivation
    
    Consider the same points on a 2D-grid to be clustered. Imagine that, instead of finding, say, 2 clusters, we are interested in creating a process in which the number of clusters progressively changes as steps go by. We can then stop when we reach the desired number of clusters. Hierarchical clustering offers us a a way to do that.
    
    The process is simple: at step 1, there are K = n clusters, where n is the total number of data points. That is, each point is its own cluster. At each time-step, we select the closest pair of clusters and merge them into one.
    
    At this point, there are a few things to explore:
    - Linkage Criterion: How do we define distance between two clusters?
    - Metric: Which metric do we use to define distance between two points?
    These questions are, of course, related, as they both reason upon distances. However, as we will see, the criterion for cluster distance is independent of the choice of metric to calculate distance between points.
    
    We will introduce and motivate some new definitions via an example. Therefore we will slightly change the common order of things and first show an example to then introduce formal definitions. Informally, the hierarchical clustering algorithm does the following:
    1. At the start, assign each point to their own cluster.
    2. Merge the two closest clusters according to a given criterion.
    3. If all points are in one cluster, stop. Otherwise, go back to step 2.
      
      Let us move on to our example.
  2. An Example
    
    Consider Figure 6.3. It shows the initial step of a hierarchical clustering algorithm. The set of points used in this example is:
    
    
    
    −10 8 
    
    P = 
    
    7 −6
    
    8 −10
    
     
    
    −6 −4
    
    −8 6
    
    2 −4
    
    Cluster 1
    
    Cluster 2
    
    Cluster 3
    
    Cluster 4
    
    Cluster 5
    
    Cluster 6
    
    10.0 Iteration 0 – Threshold 0.00
    
    7.5
    
    5.0
    
    {1}
    
    {2}
    
    {3} {4} {5}
    
    {2} 31
    
    {3} 36
    
    5
    
    {4} 16
    
    15
    
    20
    
    {5} 4
    
    27
    
    32 12
    
    {6} 24
    
    7
    
    12 8 20
    
    2.5
    
    0.0
    
    -2.5
    
    -5.0
    
    -7.5
    
    -10.0
    
    -10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 10.0
    
    Figure 6.3: Iteration t = 0 of Hierarchical Clustering. On the left there is a plot with each data point as a cluster. On the right, there is a table showing the distances between clusters. We are using Manhattan distance metric.
    
    There is also a table showing distances between clusters. In this example, we use the single-linkage criterion and Manhattan distance. We have not yet defined them, so let us start by defining Manhattan distance:
    
    Definition 6.5 (Manhattan Distance Metric – L1 Norm) Let p1, p2 ∈ RD be a pair of points in D dimensions. We use a second index to refer to their coordinates, for example p1 = (p11, p12, . . . , p1D). We then define the Manhattan distance (also called L1 norm, thus the subscript 1 after the norm symbol)
    
    Σ
    
    D
    
    p1 − p2 1 := |p1i − p2i|
    
    i=1
    
    where |x| is the usual modulus of real numbers.
    
    Exercise 4. Compute the Manhattan distance between p1 and p2, where
    
    p1 = (3, 6, 1, −1)
    
    p2 = (−3, 6, 2, 5)
    
    Cluster 1, 5
    
    Cluster 2
    
    Cluster 3
    
    Cluster 4
    
    Cluster 6
    
    10.0 Iteration 1 – Threshold 4.00
    
    7.5
    
    5.0
    
    2.5
    
    0.0
    
    -2.5
    
    -5.0
    
    {1, 5} {2} {3} {4}
    
    {2} 27
    
    {3} 32
    
    5
    
    {4} 12
    
    15
    
    20
    
    {6} 20
    
    7
    
    12
    
    8
    
    -7.5
    
    -10.0
    
    -10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 10.0
    
    Figure 6.4: Iteration t = 1 of Hierarchical Clustering. Clusters are merged if they are 4 or less apart (note threshold values at the top of plot). Cluster distances, using single-linkage criterion are shown in the table on the right.
    
    Solution to Exercise 4
    
    Since we are here, let us define the other metric we will use, the Euclidean metric (or
    
    L2 norm).
    
    Definition 6.6 (Euclidean Distance Metric – L2 Norm) Let p1, p2 ∈ RD be a pair of points in D dimensions. We use a second index to refer to their coordinates, for example p1 = (p11, p12, . . . , p1D). We then define the Euclidean distance (also called L2 norm, thus the subscript 2 after the norm symbol)
    
    p1 − p2 2 :=
    
    1
    
    Σ
    
    !
    
    D 2
    
    (p1i − p2i)2
    
    i=1
    
    Now consider the table on the right of Figure 6.3. It gives us the Manhattan distance between all clusters, which so happens to coincide with the distance between points as each cluster is a singleton set, a set with only one element. From the table, we can see that the closest clusters are {1} and {5}, we therefore merge them into {1, 5}. The left-hand
    
    side of Figure 6.4 shows the clusters merged into one. Note that the colour chosen for the merged cluster coincide with the colour of the cluster with points with lower index.ii
    
    On the right-hand side of Figure 6.4, there is a table computing the new distances between clusters. It is now that single-linkage comes in. Consider the distance between
    
    {1, 5} and {4}. Should that be the distance between point p1 and point p4, or between
    
    point p5 and point p4. The former is worse, the latter is better, so what is the correct one? Well, there is where the criterion comes in. For this example, we are using single-linkage, which means we take the shortest distance between any two points, one in each cluster. A formal definition will come later.
    
    The algorithm will progress by selecting the closest pair of clusters at each iteration.
    
    Figure
  3. Cluster Distance Criteria
    
    In this section, we present three different cluster distance criteria. We will present the definitions first, and examples to follow.
    
    Definition 6.7 (Single-Linkage Criterion) Let C1 and C2 be two clusters and · a norm (that can be, for example, L1 or L2). Then, the single-linkage criterion for distance between C1 and C2 is given by:
    
    distSL (C1, C2) := min
    
    p∈C1,q∈C2
    
    p − q (6.8)
    
    In other words, it is the shortest distance between points from one cluster to the other.
    
    Definition 6.8 (Complete-Linkage Criterion) Let C1 and C2 be two clusters and · a norm (that can be, for example, L1 or L2). Then, the complete-linkage criterion for distance between C1 and C2 is given by:
    
    distCL (C1, C2) := max
    
    p∈C1,q∈C2
    
    p − q (6.9)
    
    In other words, it is the longest distance between points from one cluster to the other.
    
    Definition 6.9 (Average-Linkage Criterion) Let C1 = {p1, . . . , pn} and also C2 =
    
    {q1, . . . , qm} be two clusters and · a norm (that can be, for example, L1 or L2). Then, the average-linkage criterion for distance between C1 and C2 is given by:
    
    AL
    
    1
    
    2
    
    mn
    
    i
    
    j
    
    n m
    
    dist
    
    (C , C ) := 1 Σ Σ p − q
    
    i=1 j=1
    
    (6.10)
    
    iiThere is an abuse of notation in both plots and tables. Instead of p1, p2, . . . , we refer to points as simply 1, 2, . . . . Also, there are no curly brackets in clusters shown the plots’ legends. These were done to simplify notation and unclutter plots and tables.
    
    Cluster 1, 5
    
    Cluster 2, 3
    
    Cluster 4
    
    Cluster 6
    
    10.0 Iteration 2 – Threshold 5.00
    
    7.5
    
    5.0
    
    2.5
    
    0.0
    
    -2.5
    
    -5.0
    
    {1, 5} {2, 3} {4}
    
    {2, 3} 27
    
    {4} 12 15
    
    {6} 20 7 8
    
    -7.5
    
    -10.0
    
    -10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 10.0
    
    Cluster 1, 5
    
    Cluster 2, 3, 6
    
    Cluster 4
    
    10.0 Iteration 3 – Threshold 7.00
    
    7.5
    
    5.0
    
    2.5
    
    0.0
    
    -2.5
    
    {1, 5} {2, 3, 6}
    
    {2, 3, 6} 20
    
    {4} 12 8
    
    -5.0
    
    -7.5
    
    -10.0
    
    -10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 10.0
    
    Figure 6.5: Iteration t = 2 (top) and t = 3 (bottom) of Hierarchical Clustering. Clusters are merged if they are 5 or less apart (top), and 7 or less apart (bottom). Cluster distances, using single-linkage criterion are shown in the tables on the right.
    
    Cluster 1, 5
    
    Cluster 2, 3, 4, 6
    
    10.0 Iteration 4 – Threshold 8.00
    
    7.5
    
    5.0
    
    2.5
    
    0.0
    
    -2.5
    
    {1, 5}
    
    {2, 3, 4, 6} 12
    
    -5.0
    
    -7.5
    
    -10.0
    
    -10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 10.0
    
    Cluster 1, 2, 3, 4, 5, 6
    
    10.0 Iteration 5 – Threshold 12.00
    
    7.5
    
    5.0
    
    2.5
    
    0.0
    
    -2.5
    
    -5.0
    
    -7.5
    
    -10.0
    
    -10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 10.0
    
    Figure 6.6: Iteration t = 4 (top) and t = 5 (bottom) of Hierarchical Clustering. Clusters are merged if they are 8 or less apart (top), and 12 or less apart (bottom). Cluster distances, using single-linkage criterion are shown in the table on the top right. Note that a threshold of 12 is enough to amalgamate all points in a single cluster.
    
    In other words, it is the average distance between points from one cluster to the other. Note that mn is the number of pairs of points between clusters.
  4. The Algorithm
    
    Up to now, you understand how hierarchical clustering works. You are also able to perform the algorithm if I give you some set of points. We have ignored, however, a few details. For example, what happens when we have two pairs of clusters with the same (shortest) distance? Do we merge one at random, both, or the one with, say, the smallest index? And finally, what if cluster C2 is equidistant of C1 and C3 and this is the shortest distance among all pairs of clusters: do we merge C2 with C1, with C3, or create a big cluster joining the three of them?
    
    We now present a more formal definition of Hierarchical Clustering addressing these and other points.
    
    Definition 6.10 (Hierarchical Clustering Algorithm) Let p1, . . . , pn ∈ RD be a set of points in D dimensions. Consider a metric for distance between points (such as Manhat- tan or Euclidean). Clusters at time t are called C(t), C(t), . . . ,. Since the number of clusters
    
    1 2
    
    will reduce throughout the algorithm via merges, we consider that the lower index at time t is used when naming the merged cluster at time t + 1. Then, Hierarchical clustering algorithm does the following:
    1. Create n clusters, one for each data point pi. At time t = 0, clusters are therefore
      
      C(t) = {p }, . . . , C(t) = {p }. Move time to t = 1.
      
      1 1 n n
    2. At time t, merge clusters that have the smallest distance between each other, according to a given criterion of distance. We evaluate, for every distinct pair of clusters Ci, Cj:
      
      (t)
      
      (t)
      
      i
      
      j
      
      threshold(t) := min dist C(t), C(t) (6.11)
      
      Ci ,Cj
      
      i
      
      j
      
      where dist C(t), C(t) can be given by, for example, the single, complete, or average
      
      linkage criteria. If only one, take the pair of clusters distant threshold(t) and merge them into one (assume i < j):
      
      C(t+1) = C(t) ∪ C(t)
      
      (6.12)
      
      i i j
      
      In case of ties, merge all clusters that are a distance of threshold(t) away from each other.
    3. Stop if there is only one cluster containing all points. Otherwise, go back to step 2 moving time forward to t + 1.
      
      Remark 6.11. Let us elaborate on the point “In case of ties, merge all clusters that are a distance of threshold(t) away from each other.” There are two main scenarios.
      1. For example, we can have C1 and C2 that are 10 away from each other and C3 and C3 that are also 10 away from each other. If 10 is the smallest distance among all pairs of clusters, we need to perform two merges: C1 and C2 become one big cluster and C3 and C4 become another big cluster.
      2. Another example is to have a ‘chain’ of clusters: e.g., we have C1 and C2 that are 10 away from each other and also C2 and C3 that are also 10 away. In this case, we merge all C1, C2, and C3 into one big cluster.
      Note that there are no correct answers for the points we raised in the beginning of this section. There are often arguments both ways. For example, merging all pairs that are distant the threshold value makes it impossible to have all values of K (we ‘jump’ from x clusters to x − 2 or less). On the other hand, defining like this ties really nicely into the use of dendograms, which we will see in Section 6.3.5.
      
      Exercise 5. If n > 1 is the number of initial points on a hierarchical clustering algorithm, what is the maximum time t until the algorithm ends?
      
      Solution to Exercise 5
      
      Exercise 6. The example given in this section, with starting configuration as in Figure 6.3, used single-linkage and L1 norm. It turns our that the clustering process differs if we use complete-linkage instead (and keep using Manhattan distance). What is the highest value K for which the algorithms differ? For example, for K = 6, the clustering is the same. It turns out that for K = 5, although distances change, the actual cluster assignment is the same. What is the highest K for which it isn’t?
      
      Solution to Exercise
  5. Dendograms
    
    How to quickly visualise the entire hierarchical clustering process? It turns out that den- dograms do just that. We best explain with an example. Consider again the clustering of the points in Figure 6.3. Figure 6.7 shows us all the 6 dendogram for each of the com- binations of three linkage criteria and two distance metrics. Note that clustering changes significantly.
    
    Dendograms may also help us to decide on a suitable value K for the number of clusters. If we need to increase a lot the threashold until the next pair of clusters is merged, this indicates that the clusters were ‘far away’. Look at Figure 6.9, it takes into account complete linkage and L1 norm. In this dendogram, it looks like K = 2 may be a good choice since one needs to increase the threshold from around 20 to around 35 to go from 2 clusters to 1 cluster. These long vertical parts of the dendogram indicate (under whichever metric or rule you are using) that takes a lot of threashold increasing to merge clusters.
    
    Exercise 7. This is a question about Hierarchical Clustering. Consider cities of a given country as points, with distances given by Euclidean distance. Imagine you run a delivery company that delivers goods by helicopter. The problem is that your helicopters have limited fuel and can travel, without stopping, between cities that are at most 100 miles apart on a straight line (same as Euclidean distance, as you can ignore the curvature of the earth in this case). Therefore you want to cluster cities in such a way that the helicopter can go from any city to any other city within the same cluster without stopping for fuel. You decided to determine the number of clusters by setting the threshold to 100. Which criterion should you use in this case?
    
    Solution to Exercise 7
    
    12
    
    10
    
    8
    
    6
    
    4
    
    2
    
    0 1 5 4 6 2 3
    
    10
    
    8
    
    6
    
    4
    
    2
    
    0 1 5 4 6 2 3
    
    Figure 6.7: Dendogram with single linkage and L1 norm.
    
    35
    
    30
    
    25
    
    20
    
    15
    
    10
    
    5
    
    Figure 6.8: Dendogram with single linkage and L2 norm.
    
    25
    
    20
    
    15
    
    10
    
    5
    
    0 1 5 2 3 4 6 0 1 5 2 3 4 6
    
    Figure 6.9: Dendogram with complete link- age and L1 norm.
    
    25
    
    20
    
    15
    
    10
    
    5
    
    Figure 6.10: Dendogram with complete link- age and L2 norm.
    
    17.5
    
    15.0
    
    12.5
    
    10.0
    
    7.5
    
    5.0
    
    2.5
    
    0 1 5 2 3 4 6
    
    0.0
    
    6 2 3 4 1 5
    
    Figure 6.11: Dendogram with average link- age and L1 norm.
    
    Figure 6.12: Dendogram with average link- age and L2 norm.
4. Want Some More Details?
  
  The book by Russel and Norvig ‘Artificial intelligence: a Modern Approach’ (fouth edition)
  
  [1] is a good source, although not very complete in some machine learning parts. Sometime I think they lack precision in definitions and details in descriptions. In any case, look into Chapter 19 of the book.
5. Version Control

Version 2 (v2)

In exercise 3 solution:

1

C(1) = {p4, p5, p6}

2

C(1) = {p1, p2, p3}

1

µ(1) = (5.67, 5.67)

2

µ(1) = (2.00, 1.67)
Added to Solution of Exercises 2 and 3.
Added a paragraph on deciding on K Section 6.3.5 (Dendograms).
Added Section 6.2.4 (‘A few words on initialisation’)
Added Section 6.2.5 (‘The property behind the squared error metric in K-means’)
Added Remark 6.11
Change last sentence in step two of Hierarchical clustering from ” If two or more pairs with no cluster overlap are also distant threshold(t), merge them too according to the same principle. If three or more clusters are on a ‘chain’ of thresholdt distance, merge all three of more in the same cluster, like: C(t+1) = C(t) ∪ C(t) ∪ C(t).” to In

i i j k

case of ties, merge all clusters that are a distance of threshold(t) away from each other. – see remark 6.11 for a clarification.

Bibliography

[1] Peter Norvig Stuart Russell. Artificial intelligence: a Modern Approach. 2020.

Whatsapp Us

[SOLVED] 5CCS2INT Introduction to Artificial Intelligence Unsupervised Learning

5CCS2INT

Introduction to Artificial Intelligence

Week 6 – Unsupervised Learning

Introduction

K-Means Clustering Algorithm

Introduction and Motivation

The Algorithm

An Example

A Few Words on Initialisation

The property behind the squared error metric in K-means.

Hierarchical Clustering Algorithm

Introduction and Motivation

Linkage Criterion: How do we define distance between two clusters?

Metric: Which metric do we use to define distance between two points?

An Example

Cluster Distance Criteria

The Algorithm

Dendograms

Want Some More Details?

Version Control

Version 2 (v2)

Bibliography

Reviews

{1}	{2}	{3} {4} {5}
{2} 31
{3} 36	5
{4} 16	15	20
{5} 4	27	32 12
{6} 24	7	12 8 20

{2} 27
{3} 32	5
{4} 12	15	20
{6} 20	7	12	8