Clustering is one of the most fascinating areas in the data mining field for me. For example, it can be shown that under some simple and fairly straightforward it is impossible to find a clustering function (Kleinberg’s impossibility theorem for clustering).That means that some trade-offs and relaxations have to be made in all the existing clustering algorithms. Every clustering algorithm will have a weakness and fail to cluster some data sets in a sensible manner.
The Fundamental Clustering Problem Suite (Ultsch, 2005) contains several synthetic data sets and a variety of clustering problems that algorithms should be able to solve. The data set contains the data and ground-truth labels as well as labels for k-Means, Single-Linkage, Wards-Linkage and a Self-Organizing map. I decided to play around with it a bit, converted the SOM activation matrix to labels using k-Means, added Spectral Clustering and Self-Tuning Spectral Clustering, and an EM Gaussian Mixture model. I was particularly curious how well Spectral Clustering would do. Determining the true number of clusters from the underlying data is an entirely different problem. Hence in all cases the number of clusters was specified unless otherwise noted. The regular Spectral Clustering used the Ng-Jordan-Weiss algorithm with a kernel sigma of 0.04 after linear scaling of the inputs. The Self-Tuning Spectral Clustering used k=15 nearest neighbors. I also used Random Forest’s “clustering”, an extension of the classification algorithm that will generate a distance matrix from the classification tree ensemble. The distance matrix was then used in single linkage to obtain cluster labels. Random Forest is a special case as it derives a distance matrix out of classification using gini – a metric built for classification, not clustering.
The Atom Data set contains two clusters, essentially two balls contained within each other. The difficulty is the difference in variance and that the classes are not linearly separable. Not surprisingly, k-Means can not separate these clusters out as it would somehow have to choose the same mean for both clusters (the center of both balls). Single linkage just works. I’m a bit puzzled why Wards Linkage fails. SOM can not separate the clusters (could this be an artifact of using k-Means to cluster the activation matrix?). Spectral Clustering was constructed specifically to deal with these kind of cases. The failure to separate the balls could stem from me using a constant kernel sigma on all data sets. Using the localized scaling of self-tuning spectral clustering just works. EM can’t model this as well due to the mean (center) of the distributions being the same. Random Forests just fail, because the distance matrix that is generated from the random permutation of the data to obtain a second class for “classification” and distance matrix generation will generate two overlapping classes.
The Chainlink Data set contains two clusters, essentially two interlocked rings that are not linearly separable. K-Means has no way of separating this as there is no point the mean could be placed resulting into a correct label assignment. Single Linkage just works and connects all the nearest neighbors to clusters. I’m not sure why Ward’s Linkage fails. SOM can not separate the clusters (could this be an artifact of using k-Means to cluster the activation matrix? In the original paper the algorithm solved this. Maybe I did something wrong). Spectral Clustering was made for these kind of manifold cases. I’m a bit surprised how well EM works in this case. I would have expected a similar result to k-Means.
The Engytime data set contains two Gaussian mixtures that are fairly close to each other. This data is primed for EM style algorithms and indeed EM performs best on this data set. K-Means performs reasonably well and picks up on the different variance. Single Linkage simply connects all the points to one big cluster and leaves two outliers to form the second clusters. I’m a bit surprised by the SOM and have no explanation why the clusters we separated the way they were. I would have thought that Gaussian Mixtures would have performed better on this one. Given that the data was generated using two Gaussians this should have been a home run for the algorithm.
Golfball contains no clusters and is one giant blob. We choose k=6 in order to see what the algorithms would do. Single linkage is the only algorithm that produces a somewhat sensible result. It connects all points to one cluster and then leaves 5 points to form the remaining clusters. It would be evident from a dendrogram that there are no clusters in this data set. All other algorithms assign labels one way or the other, most produce evenly sized areas on the ball.
Hepta contains clearly defined clusters with different variances. The clusters are clearly separated and hence all algorithms have no problems separating these clusters. SOM mislabels a few cases and I’m not sure why this is.
Lsun contains different variances and inter cluster distances. The hard part is to separate out the cluster on the bottom.
Target contains two clusters in the middle and a few outliers. The data is not linearly separable, but it’s also interesting to see how the algorithms deal with the outliers.
Tetra contains four dense, almost touching clusters.
The two diamonds data set contains two touching clusters. The cluster borders defined by density. As one would expect, single linkage fails on this data set and simply lumps everything together. Wards Linkage was made to prevent exactly this kind of problem and not surprisingly performs better. The other algorithms have no problems picking up on the two dense blobs and separate them out perfectly or close to perfect.
The Wingnut data set contains two blocks and examines the density vs. distance trade-off of the algorithms. Every method that uses distance (or something that could be interpreted as such) will have the clusters “bleed” into each other.
Note that my results for SOM differ a bit from Ultsch’s original paper; quite possibly I did something wrong. I haven’t figured out yet what went wrong; still… fun weekend 🙂
Ultsch, A.: Clustering with SOM: U*C, In Proc. Workshop on Self-Organizing Maps, Paris, France, (2005) , pp. 75-82