k-Means Clustering

As we learned in Chaps. 7 , 8 , and 9 , classification could help us make predictions on new observations. However, classification requires (human supervised) predefined label classes. What if we are in the early phases of a study and/or don’t have the

PDF / 2,995,996 Bytes
31 Pages / 439.37 x 666.142 pts Page_size
66 Downloads / 233 Views

DOWNLOAD

REPORT

k-Means Clustering

As we learned in Chaps. 7, 8, and 9, classiﬁcation could help us make predictions on new observations. However, classiﬁcation requires (human supervised) predeﬁned label classes. What if we are in the early phases of a study and/or don’t have the required resources to manually deﬁne, derive or generate these class labels? Clustering can help us explore the dataset and separate cases into groups representing similar traits or characteristics. Each group could be a potential candidate for a class. Clustering is used for exploratory data analytics, i.e., as unsupervised learning, rather than for conﬁrmatory analytics or for predicting speciﬁc outcomes. In this chapter, we will present (1) clustering as a machine learning task, (2) the silhouette plots for classiﬁcation evaluation, (3) the k-Means clustering algorithm and how to tune it, (4) examples of several interesting case-studies, including Divorce and Consequences on Young Adults, Pediatric Trauma, and Youth Development, (5) demonstrate hierarchical clustering, and (6) Gaussian mixture modeling.

13.1

Clustering as a Machine Learning Task

As we mentioned before, clustering represents classiﬁcation of unlabeled cases. Scatter plots depict a simple illustration of the clustering process. Assume we don’t know much about the ingredients of frankfurter hot dogs and we have the following graph (Fig. 13.1).

© Ivo D. Dinov 2018 I. D. Dinov, Data Science and Predictive Analytics, https://doi.org/10.1007/978-3-319-72347-1_13

443

444

13

k-Means Clustering

Fig. 13.1 Hotdogs dataset – scatterplot of calories and sodium content blocked by type of meat # See Chapter 12 code for complete details # install.packages("rvest") library(rvest) wiki_url 1 , > > li > < li d i 1 si ¼ 0, maxfli ; d i g > > > > l > : i 1, di

if di < li if di ¼ li : if di > li

Note that: • 1 si 1, • si ! 1 when di li, i.e., the dissimilarity of Xi to its cluster, C is much lower relative to its dissimilarity to other clusters, indicating a good (cluster assignment) match. Thus, high Silhouette values imply the data is appropriately clustered. • Conversely, 1 si when li di, di is large, implying a poor match of Xi with its current cluster C, relative to neighboring clusters. Xi may be more appropriately clustered in its neighboring cluster. • si 0 means that the Xi may lie on the border between two natural clusters.

13.3

13.3

The k-Means Clustering Algorithm

447

The k-Means Clustering Algorithm

The k-means algorithm is one of the most commonly used algorithms for clustering.

13.3.1 Using Distance to Assign and Update Clusters This algorithm is similar to k-nearest neighbors (KNN) presented in Chap. 7. In clustering, we don’t have a priori pre-determined labels, and the algorithm is trying to deduce intrinsic groupings in the data. Similar to KNN, k-means uses Euclidean distance (k2 norm) most of the times, however (k1 norm), or the more general Minkowski distance X Manhattan1distance n c may also be used. For c ¼ 2, the Minkowski distance jp

Data Loading...

k-Means Clustering

Recommend Documents

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

The Borda Count as a Tool for Reducing the Influence of the Distance Function on kmeans

Fuzzy Clustering

Hierarchical Clustering

Multidimensional Clustering

Co-clustering