In a survey-paper various methods for finding the number of clusters were compared (Dimitriadou et.al, An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets, Psychometrika, 2002) – and there are plenty of methods. None of them work all the time. Finding the right number of clusters has been an open problem for quite a while and also depends on the application, e.g. if more fine or course grained clusters are of interest.
A similar problem occurs in psychopathology. Imagine some measurements taken from several people – some with and some without a mental illness. The question then becomes: are there two clusters or just one? Is the data simply continuous or generated by a latent Bernoulli distribution? There is a whole bunch of literature out there dealing with the same problem from the psychology standpoint (for example: Schmidt, Kotov, Joiner, “Taxometrics – Towards a New Diagnostic Scheme for Psychopathology“, American Psychological Association) One of the more famous researchers is Paul Meehl, who developed a couple of methods to detect a taxon in data. The MAXCOV-HITMAX invented by Paul Meehl is for the detection of latent taxonic structures (i.e., structures in which the latent variable is not continuously, but rather Bernoulli, distributed).
My problem with Meehl’s methods (MAMBAC, MAXCOV, MAXEIG etc.) is that in all the articles only an intuitive explanation is given. Despite being a mathematical method there were no clear definitions of what the method will consider to be a taxon, or any necessary/sufficient conditions on when the algorithm will detect a taxon. Zoologists for example have entire conferences on how to classify species and go through a lot of painful details on how to properly classify species. They have, it seems, endless debates on what constitutes a new species in the taxonomy. However, I still wasn’t able to find a mathematical definition of what constitutes a taxon.
In addition to that, there seems to be some problems when using MAXCOV with dichotomous indicators (Maruan et.al, An Analysis of Meehl’s MAXCOV-HITMAX Procedure for the Case of Dichotomous Indicators, Multivariate Behavioral Research, Vol. 38, Issue 1 Jan. 2003); in this article they pretty much take the entire procedure apart and show that it often fails to indicate taxons when they are there or indicates taxons when there is nothing.
I think the question of finding a taxon is strongly related to clustering, because it simply tries to answer if clusters exist in the data. However, from all the clustering literature I’ve read so far, clusters are generally defined as dense areas in a space and are found in various ways by maximizing or minimizing some criterion (mutual information etc.). What constitutes a cluster is often conveniently defined so it fits the algorithm at hand. And then you still have to deal with or at least acknowledge the fact that the current notion of clustering has been proven to be impossible (An Impossibility Theorem for Clustering; Kleinberg; NIPS 15).
In a new paper in Machine Learning called Generalization from Observed to Unobserved Features by Clustering (Krupka&Tishby; JMLR 9(Mar):339–370, 2008) the authors describe an idea that might change the way we view clustering. In the paper they show that (under certain conditions) given a clustering based on some features the items will be implicitly clustered by yet unobserved features as well. As an intuitive example, imagine apples, oranges and bananas clustered by shape, color, size, weight etc. Once you have them clustered, you will be able to draw conclusions about a yet unobserved feature, e.g. the vitamin content. The work, because it is oriented on the features, might even be a way around the impossibility-theorem.
This is half-way there for a nicer definition of a taxon or what should constitute a cluster for that matter: can we draw conclusions about features not used for the clustering process? If you are clustering documents by topic (using bag-of-words), can you predict which other words will appear in the article? If you cluster genes, can you predict other genes from the cluster-membership?
Re-clustering on only a subset of the features should also be a sanity check for clustering solutions (I had written about the McIntyre-Blashfield procedure and validating clustering solutions before). I think strong patterns should replicate with less features; at least they did in a clustering-study I did recently .
I’ll be pondering this idea and try formalizing it. Maybe I can come up with something usable for taxometrics and a means to get the number of clusters…