k-Means clustering (aka segmentation) is one of the most common Machine Learning methods out there, dwarfed perhaps only by Linear Regression in its popularity. While basic k-Means algorithm is very simple to understand and implement, therein lay many a nuances missing which out can be dangerous. A good analyst doesn't just know his/her techniques perfunctorily but knows ins-and-outs to understand implication and interpret the results in variety of scenarios. In three-part series starting with this post, we will explore finer aspects related to k-Means clustering.
This post assumes prior knowledge of k-Means algorithm. If you aren’t familiar then go through wiki-article or any standard text-book to understand and then come back here for deep-dive.
Right K
One primary and significant drawback of k-Means is that it requires prior knowledge or assumption about number of clusters. How can we know beforehand what’s right number of clusters for given data and business problem? This post will present variety of useful approaches. In this discussion it will be useful to consider a working example.Fig. 1 presents a scatter plot of 1000 observations dummy-data. We can see, visually, that there are 3 clusters here. In real life, of course, observations have more than two dimensions and cannot be visualized easily. Further, cluster separation isn’t likely to be so obvious.
- Domain Knowledge – This is most common method in practice because often segmentation doesn’t exist in vacuum and is aimed towards solving a business problem. So if business requirement or client prefers certain number or range of clusters, then that can be useful to select k. For instance, business may prefer three customer segments H/M/L.
- Rule of Thumb – Very rough rule of thumb is , where n is number of data points, but in reality this is never really useful. This rule gives clusters for our example data which is very far from true number of 3.
- Elbow-Method using WCSS – This is one of the most common and technically robust methods. This is based on principle that while clustering performance as measured by WCSS increases (i.e. WCSS decreases) with increase in k, rate of increase is usually decreasing. So performance improvement for increasing number of cluster from, say, 3 to 4 is higher than that for increasing from 4 to 5. Plotting WCSS against increasing k can show an ‘elbow’ which demarks significant drop in rate of increase. Selecting number of clusters corresponding to elbow point achieves reasonable performance without having too many clusters. This is still judgmental since what constitutes elbow is visually determined. Further, in practice, there may not be an elbow but smooth curve, or, there may be more than one elbow. Looking at Fig. 2 we note a sharp decrease in rate of decrease in WCSS for 3 clusters. Remember that decrease in WCSS means increasing in clustering performance. In this case, this method is able to help us arrive at true number of clusters.
- Cluster Quality using Silhouette Coefficient – Silhouette coefficient is another quality measure of clustering – and it applies to any clustering, not just k-Means. Silhouette-Coefficient of observation i is calculated as
In next post, we will cover data pre-processing required for correct implementation of k-Means.
R tip
For those interested and familiar with R, above results can be replicated using following functions:result < - kmeans(dataframe, centers=k)
– this runs k-Means clustering on dataframe of observationsresult$tot.withinss
– this returns WCSSsc < - silhouette(result$cluster, daisy(dataframe))
– this compute silhouette coefficient for k-Means result, and is part of MASS packagesum(summary(sc)$clus.avg.widths*summary(sc)$clus.sizes)/sum(summary(sc)$clus.sizes)
– this computes overall SC
This comment has been removed by a blog administrator.
ReplyDelete