- Markus Breitenbach - http://blog.markus-breitenbach.com -
Validating patterns found by Data Mining techniques
Posted By Markus On February 22, 2007 2:50 pm @ 2:50 pm (February 22, 2007) In Statistics, Data Mining | 1 Comment
I just read in the news about a study that showed with data mining that [1] libras live longer (study from the Institute for Clinical Evaluative Sciences in Toronto). These guys did a study for fun using data from 10 million Ontario residents looking for associations between various health problems and their astrological signs. And they actually found associations! Each of the twelve astrological signs had at least two medical disorders associated with them. However, they were not able to replicate it when looking for the same pattern in the hold-out set.
This is a nice example of why it is important to have a hold-out set, or ideally to try to redo a study with a different method on a different set of similar data. A couple of other methods for validating have been proposed which are harder to do. In my data mining work with Tim we have used a cross-replication and cluster validation design from the book “Classification” by Gordon (1999). The original method (I think) was proposed by McIntyre and Blashfield (”A nearest-Centroid technique for evaluating the minimum variance clustering procedure”, Multivariate Behavioral Research 15, 1980; see also Milligan 1996 “Clustering Validation: results and implications for applied analysis, in “Clustering and Classification”; eds. Arabie, Hubert and De Soete, World Scientific, Singapore, p. 341). What I particularly like about the method is that it can be used to quantify in numbers “how much” of a replication you got. The method works like this:
Given our tendency to find patterns in data (or see shapes in clouds) I think it is important to use a procedure like the above to double-check the patterns discovered before any important decisions are made.
Article printed from Markus Breitenbach: http://blog.markus-breitenbach.com
URL to article: http://blog.markus-breitenbach.com/2007/02/22/validating-patterns-found-by-data-mining-techniques/
URLs in this post:
[1] libras live longer: http://abcnews.go.com/Technology/print?id=2890150
Click here to print.