Tuesday, July 28, 2015

Beyond the k-Means – the Right k

http://www.edupristine.com/blog/beyond-k-means

k-Means clustering (aka segmentation) is one of the most common Machine Learning methods out there, dwarfed perhaps only by Linear Regression in its popularity. While basic k-Means algorithm is very simple to understand and implement, therein lay many a nuances missing which out can be dangerous. A good analyst doesn't just know his/her techniques perfunctorily but knows ins-and-outs to understand implication and interpret the results in variety of scenarios. In three-part series starting with this post, we will explore finer aspects related to k-Means clustering.
This post assumes prior knowledge of k-Means algorithm. If you aren’t familiar then go through wiki-article or any standard text-book to understand and then come back here for deep-dive.

Right K

One primary and significant drawback of k-Means is that it requires prior knowledge or assumption about number of clusters. How can we know beforehand what’s right number of clusters for given data and business problem? This post will present variety of useful approaches. In this discussion it will be useful to consider a working example.
Fig. 1 presents a scatter plot of 1000 observations dummy-data. We can see, visually, that there are 3 clusters here. In real life, of course, observations have more than two dimensions and cannot be visualized easily. Further, cluster separation isn’t likely to be so obvious.
Figure 1: Example Cluster Data (with True Clustering)
Example Cluster Data (with True Clustering)
Implicit objective function in k-Means measures sum of distances of observations from their cluster centroids, called Within-Cluster-Sum-of-Squares (WCSS). This is computed as
where Yi is centroid for observation Xi. By definition, this is geared towards maximizing number of clusters, and in limiting case each data point becomes its own cluster centroid. This is, naturally, neither practical nor desirable. Fig. 2 plots WCSS for k=1.20 and we can see that it continuously drops, indicating more clusters the better!
Figure 2: WCSS decreasing with k
WCSS decreasing with k
Given that k-Means has no in-built preference for right number of clusters, following are some of the common ways k can be selected:
  1. Domain Knowledge – This is most common method in practice because often segmentation doesn’t exist in vacuum and is aimed towards solving a business problem. So if business requirement or client prefers certain number or range of clusters, then that can be useful to select k. For instance, business may prefer three customer segments H/M/L.
  2. Rule of Thumb – Very rough rule of thumb is , where n is number of data points, but in reality this is never really useful. This rule gives clusters for our example data which is very far from true number of 3.
  3. Elbow-Method using WCSS – This is one of the most common and technically robust methods. This is based on principle that while clustering performance as measured by WCSS increases (i.e. WCSS decreases) with increase in k, rate of increase is usually decreasing. So performance improvement for increasing number of cluster from, say, 3 to 4 is higher than that for increasing from 4 to 5. Plotting WCSS against increasing k can show an ‘elbow’ which demarks significant drop in rate of increase. Selecting number of clusters corresponding to elbow point achieves reasonable performance without having too many clusters. This is still judgmental since what constitutes elbow is visually determined. Further, in practice, there may not be an elbow but smooth curve, or, there may be more than one elbow. Looking at Fig. 2 we note a sharp decrease in rate of decrease in WCSS for 3 clusters. Remember that decrease in WCSS means increasing in clustering performance. In this case, this method is able to help us arrive at true number of clusters.
  4. Cluster Quality using Silhouette Coefficient – Silhouette coefficient is another quality measure of clustering – and it applies to any clustering, not just k-Means. Silhouette-Coefficient of observation i is calculated as
where a is average distance to all other observations within same cluster as that of observation i while b is minimum of average distance to all other observations from all other clusters. Silhouette coefficient of clustering result is average of si for all observations i. This metric is between +1 representing best clustering and -1 representing worst clustering. While WCSS is comparable for same data for different k, its number is not comparable across different clustering solutions on different data, and hence doesn’t have absolute threshold. On the other hand, Silhouette Coefficient has fixed range and hence can be used overall metric comparing quality of clustering irrespective of data or number of clusters.
Figure 3: Silhouette Coeff changing with k
Silhouette  Coeff changing with k
Plotting SC against K we see highest coefficient of 0.63 with 3 clusters and second highest of 0.60 with 2 clusters. For higher number of clusters, SC sharply drops and stays low. This is because further fragmenting a given cluster makes both a and b closer to each other.
In next post, we will cover data pre-processing required for correct implementation of k-Means.

R tip

For those interested and familiar with R, above results can be replicated using following functions:
result < - kmeans(dataframe, centers=k) – this runs k-Means clustering on dataframe of observations
result$tot.withinss – this returns WCSS
sc < - silhouette(result$cluster, daisy(dataframe)) – this compute silhouette coefficient for k-Means result, and is part of MASS package
sum(summary(sc)$clus.avg.widths*summary(sc)$clus.sizes)/sum(summary(sc)$clus.sizes) – this computes overall SC

1 comment: