- Advertising (1)
- Artificial Intelligence (AI) (13)
- Classification (3)
- Clustering (1)
- Coding / Programming (8)
- Cryptography (1)
- Data Mining (19)
- Economy / Investing (1)
- ewrt linux (2)
- Fixing Stuff (8)
- Machine Learning (30)
- Math (2)
- Politics (3)
- Predictive Modeling (4)
- Psychology (3)
- Ramblings (26)
- Random (9)
- Security (15)
- Society (12)
- Sociology (4)
- spam (3)
- Statistics (15)
- July 11, 2010 8:56 pm: GraphLab & Parallel Machine Learning
- June 15, 2010 8:21 pm: PHP configuration using htaccess on 1and1 shared hosting
- February 28, 2010 12:21 pm: Energy efficient data mining algorithms
- February 16, 2010 11:56 pm: Alternative measures to the AUC for rare-event prognostic models
- January 26, 2010 9:54 pm: Spam Filtering by Learning a Pattern Language
- January 10, 2010 5:37 pm: Strong profiling is not mathematically optimal for discovering rare malfeasors (on rare event detection)
- November 13, 2009 12:27 am: Starcraft AI competition
- July 25, 2009 8:34 pm: Random characters in text mode -> graphics card
- June 7, 2009 5:04 pm: Programs stealing the input focus
- May 2, 2009 4:06 pm: Famous bugs in AI game engine caught on tape
Blogroll
Uncategorized
Useful Links
- July 2010
- June 2010
- February 2010
- January 2010
- November 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
Validating patterns found by Data Mining techniques
I just read in the news about a study that showed with data mining that libras live longer (study from the Institute for Clinical Evaluative Sciences in Toronto). These guys did a study for fun using data from 10 million Ontario residents looking for associations between various health problems and their astrological signs. And they actually found associations! Each of the twelve astrological signs had at least two medical disorders associated with them. However, they were not able to replicate it when looking for the same pattern in the hold-out set.
This is a nice example of why it is important to have a hold-out set, or ideally to try to redo a study with a different method on a different set of similar data. A couple of other methods for validating have been proposed which are harder to do. In my data mining work with Tim we have used a cross-replication and cluster validation design from the book “Classification” by Gordon (1999). The original method (I think) was proposed by McIntyre and Blashfield (”A nearest-Centroid technique for evaluating the minimum variance clustering procedure”, Multivariate Behavioral Research 15, 1980; see also Milligan 1996 “Clustering Validation: results and implications for applied analysis, in “Clustering and Classification”; eds. Arabie, Hubert and De Soete, World Scientific, Singapore, p. 341). What I particularly like about the method is that it can be used to quantify in numbers “how much” of a replication you got. The method works like this:
- Divide the data randomly into two subsets
- Do your exploratory clustering work on set A, partitioning it into k many clusters
- Use a classifier (Nearest Neighbor for example) to assign each object in B to each discovered cluster
- Use the same clustering procedure on B, partitioning it into k many clusters
- Compare the two labelings obtained for B (the classification and clustering) in a cross-tabulation, compute a kappa-coefficient
Given our tendency to find patterns in data (or see shapes in clouds) I think it is important to use a procedure like the above to double-check the patterns discovered before any important decisions are made.
February 27, 2007 2:19 pm at 2:19 pm (February 27, 2007)
Check out this paper: Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124
http://dx.doi.org/10.1371/journal.pmed.0020124
To quote from the Abstract: “There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. […] Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. […]”
Also interesting: Djulbegovic B, Hozo I (2007) When Should Potentially False Research Findings Be Considered Acceptable? PLoS Med 4(2): e26 doi:10.1371/journal.pmed.0040026
Also: Moonesinghe R, Khoury MJ, Janssens ACJW (2007) Most Published Research Findings Are False-But a Little Replication Goes a Long Way. PLoS Med 4(2): e28 doi:10.1371/journal.pmed.0040028
March 3, 2010 1:24 pm at 1:24 pm (March 3, 2010)
http://arstechnica.com/science/news/2010/03/were-so-good-at-medical-studies-that-most-of-them-are-wrong.ars