k-Means Clustering

As we learned in Chaps.  7 ,  8 , and  9 , classification could help us make predictions on new observations. However, classification requires (human supervised) predefined label classes. What if we are in the early phases of a study and/or don’t have the

  • PDF / 2,995,996 Bytes
  • 31 Pages / 439.37 x 666.142 pts Page_size
  • 66 Downloads / 205 Views

DOWNLOAD

REPORT


k-Means Clustering

As we learned in Chaps. 7, 8, and 9, classification could help us make predictions on new observations. However, classification requires (human supervised) predefined label classes. What if we are in the early phases of a study and/or don’t have the required resources to manually define, derive or generate these class labels? Clustering can help us explore the dataset and separate cases into groups representing similar traits or characteristics. Each group could be a potential candidate for a class. Clustering is used for exploratory data analytics, i.e., as unsupervised learning, rather than for confirmatory analytics or for predicting specific outcomes. In this chapter, we will present (1) clustering as a machine learning task, (2) the silhouette plots for classification evaluation, (3) the k-Means clustering algorithm and how to tune it, (4) examples of several interesting case-studies, including Divorce and Consequences on Young Adults, Pediatric Trauma, and Youth Development, (5) demonstrate hierarchical clustering, and (6) Gaussian mixture modeling.

13.1

Clustering as a Machine Learning Task

As we mentioned before, clustering represents classification of unlabeled cases. Scatter plots depict a simple illustration of the clustering process. Assume we don’t know much about the ingredients of frankfurter hot dogs and we have the following graph (Fig. 13.1).

© Ivo D. Dinov 2018 I. D. Dinov, Data Science and Predictive Analytics, https://doi.org/10.1007/978-3-319-72347-1_13

443

444

13

k-Means Clustering

Fig. 13.1 Hotdogs dataset – scatterplot of calories and sodium content blocked by type of meat # See Chapter 12 code for complete details # install.packages("rvest") library(rvest) wiki_url 1 , > > li > < li  d i 1  si ¼  0, maxfli ; d i g > > > > l > : i  1, di

if di < li if di ¼ li : if di > li

Note that: • 1  si  1, • si ! 1 when di  li, i.e., the dissimilarity of Xi to its cluster, C is much lower relative to its dissimilarity to other clusters, indicating a good (cluster assignment) match. Thus, high Silhouette values imply the data is appropriately clustered. • Conversely, 1 si when li  di, di is large, implying a poor match of Xi with its current cluster C, relative to neighboring clusters. Xi may be more appropriately clustered in its neighboring cluster. • si  0 means that the Xi may lie on the border between two natural clusters.

13.3

13.3

The k-Means Clustering Algorithm

447

The k-Means Clustering Algorithm

The k-means algorithm is one of the most commonly used algorithms for clustering.

13.3.1 Using Distance to Assign and Update Clusters This algorithm is similar to k-nearest neighbors (KNN) presented in Chap. 7. In clustering, we don’t have a priori pre-determined labels, and the algorithm is trying to deduce intrinsic groupings in the data. Similar to KNN, k-means uses Euclidean distance (k2 norm) most of the times, however (k1 norm), or the more general Minkowski distance X Manhattan1distance  n c may also be used. For c ¼ 2, the Minkowski distance jp