Clustering
Chapter 10
Clustering
Copyright By Assignmentchef assignmentchef
Clustering is the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster
have high similarity
are very dissimilar to objects in other clusters
Clustering is called data segmentation because it partitions large data sets into groups according to their similarity
Clustering can be used for outlier detection
Clustering is known as unsupervised learning: class label information is not present
a form of learning by observation, rather than learning by examples
Requirements of clustering
Scalability
Ability to deal with different types of attributes (not only numerical)
Discovery of clusters with arbitrary shape (not only spherical)
Limit requirements for domain knowledge to determine input parameters (the quality of the result depends on parameters)
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
Capability of clustering high-dimensionality data
Constraint-based clustering
Interpretability and usability of clustering results
Comparison criteria
Partitioning criteria
Single level partitioning
Hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
Separationofclusters
Exclusive (e.g., one customer belongs to only one region)
Non-exclusive (e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) Connectivity-based (e.g., density or contiguity)
Clusteringspace
Fullspace(oftenwhenlowdimensional)
Subspaces(ofteninhigh-dimensionalclustering)
Overview of major clustering methods (1)
Partitioning methods
Find mutually exclusive clusters of spherical shape
Distance-based
May use mean or medoid (etc.) to represent cluster center Effective for small to medium-size datasets
Typical methods: k-means, k-medoids
Hierarchical methods
Clustering is a hierarchical decomposition (i.e., multiple levels) Cannot correct erroneous merges or splits
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Overview of major clustering methods (2)
Density-based methods
Can find arbitrarily shaped clusters
Clusters are dense regions of objects in space that are separated by low-density regions
Cluster density: each point must have a minimum number of points within its neighborhood
May filter out outliers
Typical methods: DBSACN, OPTICS, DenClue
Grid-based methods
Use a multiresolution grid data structure
Fast processing time (typically independent of the number of data objects, yet dependent on grid size)
Typical methods: STING, WaveCluster, CLIQUE
Partitioning methods
a data set D of n objects
the number k (k<=n) of clustersknown in advance A partitioning algorithm organizes the objects into k partitions where each partition represents a cluster The clusters are formed to optimize an objective partitioning criterion Objects in a cluster are similar to one another Objects in a cluster are dissimilar to objects in other clusters Centroid-based technique (1) Distribute the objects in D into k clusters C1,…,Ck with !” % !” !’ = Each cluster !” is represented by its centroid *+ (e.g., computed as themean or medoid of the objects) The quality of cluster !” is measured as the within cluster variation,that is the sum of squared errors between objects and the centroid :,=–./012,*+ 2 “;< 689Centroid-based technique (2) ./01 2, *+ is the Euclidean distance Objective: make the resulting k clusters as compact and as separateas possible Global optimal solution NP-hard problem Computationally challenging Greedy approaches 1. arbitrarily choose k objects from D as the initial cluster centers3. (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster4. update the cluster means, that is, calculate the mean value of the objects for each cluster5. until no change k-means: examplek-means: pros and cons Usually identifies a local optimum Advantages The algorithm is efficient: O(tkn), wit n objects, k clusters, and t iterations Normally k, t << n Disadvantages Can be applied only when the mean of a cluster is defined Need to specify k, the number of clusters, in advance Sensitive to noisy data and outliers (can influence the mean) Not suitable to discover clusters with non-convex shapes k-medoid (1) k-means is sensitive to outliers Need to cluster 1, 2, 3, 8, 9, 10, 25 with k=2 Ideal solution: {1,2,3} {8,9,10} and 25 is discarded as outlier Solution {1,2,3}, {8,9,10,25} has E=196 Solution {1,2,3,8}, {9,10,25} has E=189,67selected by k-means! k-medoids enhances k-means not to be sensitive to outliers k-medoid (2) Each cluster !” is represented by one of its objects (representative object medoid =+) The quality of cluster !” is measured using the absolute-error criterion :,=–./012,=+ 2 “;< 689 k-medoid problem is NP-hardPartitioning Around Medoids (PAM)1. arbitrarily choose k objects in D as the initial representative objects3. assign each remaining object to the cluster with the nearest representative object4. randomly select a non-representative object, orandom5. compute the total cost S of swapping representativeobject oj with orandomthen swap oj with orandom to form the new setof k representative objects7. until no changePAM: swap evaluation (2) Current representative objects {o1,…,ok} Replace oj with orandom and consider {o1,…, oj-1, orandom, oj+1,…,ok} For each p in D, find the closest representative Compute the error E with the new set of representative objects Loweraswap to orandom Higherakeep the original set of representatives The complexity of each iteration is O(k(n-k)2) PAM: swap evaluation (1)k-means vs k-medoid k-medoid is more robust than k-means in presence of noise and/or outliers k-medoid is more costly that the k-means k-medoid and k-means both require to select k in advance PAM does not scale well with the dataset size Alternative more scalable solutions that select medoids among a random sample of data items (e.g., CLARA, CLARANS)Hierarchical clustering Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach Divisive: top-down approach Important: if a merge or split turns out to be poor choice, the methods cannot correct itAgglomerative and divisive clustering (1) Agglomerative: use a bottom-up strategy Each object initially forms a cluster Clusters are iteratively merged into larger and larger clusters Merge clusters according to a similarity measure Merging terminates when all objects are in a single class Divisive: use a top-down approach All the objects initially belong to the same (root) cluster Clusters are recursively partitioned into smaller clusters Partitioning continues until each clusters is coherent enoughAgglomerative and divisive clustering (2)agglomerative (AGNES)divisive (DIANA) Distance measure: minimum distance Agglomerative and divisive methods need to measure the distance between clustersdmin(Ci,Cj)=minpICi,p’ICj |p-p’| Algorithms that use minimum distance are called nearest-neighbor Minimum distance clustering algorithm If the clustering process terminates when the minimum distance between nearest clusters exceeds an arbitrary threshold, it is called single-linkage algorithm An agglomerative algorithm that uses the minimum distance measure is also called minimal spanning tree algorithmDistance measure: maximum distanceMaximumdistance dmax(Ci,Cj)=maxpICi,p’ICj |p-p’| Algorithms that use maximum distance are called farthest-neighborclustering algorithm If the clustering process terminates when the maximum distance between nearest clusters exceeds an arbitrary threshold, it is called complete-linkage algorithmDistance measure: mean and average dist. Minimum and maximum distance tend to be overly sensitive to outliers or noisy dataMeandistance dmean(Ci,Cj)=|mi -mj | with mi the mean for cluster !”Averagedistance davg(Ci,Cj)= 1 aa|p-p’| ninj pICip’ICj with with ni the number of objects in cluster !” Multiphase clustering It is difficult to select merge/split points No backtracking (decisions are definitive) Hierarchical clustering does not scale well Solution: combine hierarchical clustering with other clustering techniques Multi-phase clustering Balanced Iterative Reducing and Clustering using Hierarchies Phase 1: BIRCH scans the database to build an initial in-memory CF-tree (multilevel compression of the data that tries to preserve the datas inherent clustering structure) Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF-tree Scales linearly finds a good clustering with a single scan improves the quality with a few additional scans Weakness handles only numeric data is sensitive to the order of the data recordClustering feature Clustering Feature (CF): CF = (N, LS, SS) N: number of data points LS: linear sum of N points SS: square sum of N point 3-D vector summarizing information about clusters of objects summary of the statistics for the given cluster avoid storing the detailed informationabout individual objects or points CFs are additiveCF = (5, (16,30),(54,190))(3,4) (2,6) (4,5) (4,7) (3,8) 10 9 8 7 6 5 4 3 2 1 00 1 2 3 4 5 6 7 8 9 10 A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical clustering Non-leaf nodes store sums of the CFs of their children (summarize clustering information about their children) A CF tree has two parameters Branching factor: max # of children Threshold: max diameter of sub-clusters stored at the leaf nodes CF-tree: example Non-leaf node BIRCH algorithm1 a ( xi – x j ) 2 n(n-1) For each point in the input Find closest leaf entry Add point to leaf entry and update CF If entry_diameter > max_diameter, then split leaf, and possibly parents
Algorithm has complexity O(n)
Concerns
Sensitive to insertion order of data points
Since we fix the size of leaf nodes, clusters may be not so natural
Clusters tend to be spherical given the radius and diameter measures on which the algorithm is based
Density-based clustering
Clusters are modeled as dense regions in the data space, separated by sparse regions
Features
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Basic concepts (1)
> > 0 user-defined parameter
>-neighborhood of object p: space with radius > centered at p
Density of a neighborhood: number of objects in the neighborhood
MinPts: density threshold of dense region
Core object: if its >-neighborhood contains at least MinPts objects
An object q is directly density-reachable from a core object p if q is within the >-neighborhood of p
A core object can bring all the objects in its >-neighborhood in a density region
Basic concepts (2)
p is density-reachable from q if there is a chain of objects p1,,pn, such that p1=q, pn=p, and pi+1 is directly density-reachable from pi for 1<=i<=n Not an equivalence relationship! Not symmetric q is directly density reachable from m q is (indirectly) density reachable from p p is not density reachable from q because q is not a core objectBasic concepts (3) p1, p2 are density-connected if there is an object q such that both p1 and p2 are density reachable from q It is an equivalence relationship! Example: p and q are density connected, thanks to the presence DBSCAN (1) A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noiseMinPts = 5 DBSCAN (2) Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. > and MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database
Continue the process until all the points have been processed
Complexity: O(n log n)
Evaluation of clustering
Assessing clustering tendency
Clustering analysis on a data set is meaningful only when there is a non-
random structure in the data
Determining the number of clusters in a data set
the number of clusters is an interesting and important summary statistic of a data set
it should be estimated in advance
Measuring clustering quality
how well do the clusters fit the data set? how good are the resulting clusters?
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.