I just read in the news about a study that showed with data mining that libras live longer (study from the Institute for Clinical Evaluative Sciences in Toronto). These guys did a study for fun using data from 10 million Ontario residents looking for associations between various health problems and their astrological signs. And they actually found associations! Each of the twelve astrological signs had at least two medical disorders associated with them. However, they were not able to replicate it when looking for the same pattern in the hold-out set.
This is a nice example of why it is important to have a hold-out set, or ideally to try to redo a study with a different method on a different set of similar data. A couple of other methods for validating have been proposed which are harder to do. In my data mining work with Tim we have used a cross-replication and cluster validation design from the book “Classification” by Gordon (1999). The original method (I think) was proposed by McIntyre and Blashfield (“A nearest-Centroid technique for evaluating the minimum variance clustering procedure”, Multivariate Behavioral Research 15, 1980; see also Milligan 1996 “Clustering Validation: results and implications for applied analysis, in “Clustering and Classification”; eds. Arabie, Hubert and De Soete, World Scientific, Singapore, p. 341). What I particularly like about the method is that it can be used to quantify in numbers “how much” of a replication you got. The method works like this:
- Divide the data randomly into two subsets
- Do your exploratory clustering work on set A, partitioning it into k many clusters
- Use a classifier (Nearest Neighbor for example) to assign each object in B to each discovered cluster
- Use the same clustering procedure on B, partitioning it into k many clusters
- Compare the two labelings obtained for B (the classification and clustering) in a cross-tabulation, compute a kappa-coefficient
Given our tendency to find patterns in data (or see shapes in clouds) I think it is important to use a procedure like the above to double-check the patterns discovered before any important decisions are made.
Check out this paper: Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124
http://dx.doi.org/10.1371/journal.pmed.0020124
To quote from the Abstract: “There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. […] Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. […]”
Also interesting: Djulbegovic B, Hozo I (2007) When Should Potentially False Research Findings Be Considered Acceptable? PLoS Med 4(2): e26 doi:10.1371/journal.pmed.0040026
Also: Moonesinghe R, Khoury MJ, Janssens ACJW (2007) Most Published Research Findings Are False-But a Little Replication Goes a Long Way. PLoS Med 4(2): e28 doi:10.1371/journal.pmed.0040028
http://arstechnica.com/science/news/2010/03/were-so-good-at-medical-studies-that-most-of-them-are-wrong.ars