IE 332 in-class session
Nov 3rd: A3 Unsupervised Learning
Unsupervised learning: Cluster Analysis
Copyright By Assignmentchef assignmentchef
Use Fall 2020 as an example
Steps of a Cluster Analysis:
Step 1: Select variables for clustering.
Step 2: Scale the data (sometimes optional because many clustering packages now will do the
scaling for you).
Step 3: Define similarity(distance) measure. (How to calculate the distance?)
Step 4: Decide the Clustering Method (Kmeans or Hierarchical?)
Step 5: Decide the Number of Clusters.
Step 6: Evaluate the clustering result.
Find the optimal number of clustersWSS
WSS: within cluster sum of squares
https://uc-r.github.io/km eans_clustering
kmeans function in R
Observe the parameters of the kmeans function: x, centers, nstart
km <- kmeans(x = cust,centers = k, nstart = 25) nstart: attempts multiple initial configurations and reports on the best one Observe the results from kmeans function: cluster, centers, totss, tot.withinss, wss <- km$tot.withinss Calculate WSS of kmeans in Rwss <- km$tot.withinssfviz_nbclust function from factoextra packageWhy not choose k = 10?Find the optimal number of clusters—Silhouette(si.lu.wet) Silhouette(S) mean intra-cluster distance(I): Mean distance between the observation and all otherdata points in the same cluster mean nearest-cluster distance(N): Mean distance between the observation and allother data points of the next nearest cluster. S = (N-I)/max(N,I) (calculate this for each data sample) https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam Higher the better Calculate Silhouette of kmeans in R library(cluster): silhouette(km$cluster, dist(df)) library(factoextra) Find the optimal number of clusters—Problem based Wss says 2, Silhouette says 5, which one to pick? What is the objective of our problem? Minimize the distance between eachcustomer and its assigned facility plus the cost of opening this facility Find the optimal number of clusters—Problem based Wss says 2, Silhouette says 5, which one to pick? What is the objective of our problem? If we ignore the fixed_cost, what is function findObj doing? Does it lookfamiliar?(Hint: we talked about this a few slides ago) Answer the question in the WSS slide(Why not choose k=10) If k is too large, it suffers from overfitting problem It is also subject to practical considerations. (The cost to open each facility is high) Select variables for clustering Examine your best result from part (a) and identify any significant issues with the result with respect to how well it actually solves the problem. Select variables for clustering Examine your best result from part (a) and identify any significant issues with the result with respect to how well it actually solves the problem. Instead of finding clusters based on 3 dimensions, we have to find based on x-y coordinates only. How to deal with the priority column? Instead of finding clusters based on 3 dimensions, we have to find based on x-y coordinates only. How to deal with the priority column? Moreover, you want to locate the facility closer to those customers who have a higher priority Adding dummy customers How to deal with the priority column? Moreover, you want to locate the facility closer to those customers who have a higher priority Priorities could be included by adding dummy customers to the existing data set. For example, if $p_j$ is 3, then we will create 2 more dummy customers for this customer.What if the priority is not an integer? However, priorities must be included by adding dummy customers to the existing data set. For example, if $p_j$ is 3, then we will create 2 more dummy customers for this location. What if the priority is not an integer? CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.