You are currently browsing the archives for the Statistics category.
| M | T | W | T | F | S | S |
|---|---|---|---|---|---|---|
| « Jan | ||||||
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 7 | 8 | 9 | 10 | 11 | 12 | 13 |
| 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| 21 | 22 | 23 | 24 | 25 | 26 | 27 |
| 28 | 29 | 30 | 31 | |||
- Advertising (1)
- Artificial Intelligence (AI) (13)
- Classification (3)
- Clustering (1)
- Coding / Programming (8)
- Cryptography (1)
- Data Mining (22)
- Economy / Investing (1)
- ewrt linux (2)
- Fixing Stuff (8)
- Machine Learning (32)
- Math (2)
- Politics (3)
- Predictive Modeling (5)
- Psychology (3)
- Ramblings (26)
- Random (9)
- Security (16)
- Society (13)
- Sociology (4)
- spam (3)
- Statistics (20)
- January 28, 2012 4:56 pm: Will 2012 be the year of Big Data?
- August 14, 2011 10:41 pm: UK plans to exempt data mining from copyright laws
- June 21, 2011 3:26 am: Risk Assessment of Rare Events in adversarial Scenarios
- March 26, 2011 7:57 pm: How Kinect body tracking works and how Machine Learning helped
- March 1, 2011 11:58 am: European Court of Justice ruling (indirectly) on what cannot be used in Insurance Risk Models
- December 11, 2010 8:35 pm: Mining of Massive Datasets
- December 4, 2010 2:28 pm: Ideas on communicating risks and probabilities to the general public
- October 17, 2010 5:48 pm: Birthday Paradox
- August 5, 2010 1:06 am: Elo Scores and Rating Contestants
- July 11, 2010 8:56 pm: GraphLab & Parallel Machine Learning
Blogroll
Uncategorized
Useful Links
- January 2012
- August 2011
- June 2011
- March 2011
- December 2010
- October 2010
- August 2010
- July 2010
- June 2010
- February 2010
- January 2010
- November 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
Archive for the Statistics Category
Taxons, Taxometrics and the Number of Clusters
August 21, 2008 9:17 pm by Markus.
In a survey-paper various methods for finding the number of clusters were compared (Dimitriadou et.al, An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets, Psychometrika, 2002) - and there are plenty of methods. None of them work all the time. Finding the right number of clusters has been an open problem for quite a while and also depends on the application, e.g. if more fine or course grained clusters are of interest.
A similar problem occurs in psychopathology. Imagine some measurements taken from several people - some with and some without a mental illness. The question then becomes: are there two clusters or just one? Is the data simply continuous or generated by a latent Bernoulli distribution? There is a whole bunch of literature out there dealing with the same problem from the psychology standpoint (for example: Schmidt, Kotov, Joiner, “Taxometrics - Towards a New Diagnostic Scheme for Psychopathology“, American Psychological Association) One of the more famous researchers is Paul Meehl, who developed a couple of methods to detect a taxon in data. The MAXCOV-HITMAX invented by Paul Meehl is for the detection of latent taxonic structures (i.e., structures in which the latent variable is not continuously, but rather Bernoulli, distributed).
My problem with Meehl’s methods (MAMBAC, MAXCOV, MAXEIG etc.) is that in all the articles only an intuitive explanation is given. Despite being a mathematical method there were no clear definitions of what the method will consider to be a taxon, or any necessary/sufficient conditions on when the algorithm will detect a taxon. Zoologists for example have entire conferences on how to classify species and go through a lot of painful details on how to properly classify species. They have, it seems, endless debates on what constitutes a new species in the taxonomy. However, I still wasn’t able to find a mathematical definition of what constitutes a taxon.
In addition to that, there seems to be some problems when using MAXCOV with dichotomous indicators (Maruan et.al, An Analysis of Meehl’s MAXCOV-HITMAX Procedure for the Case of Dichotomous Indicators, Multivariate Behavioral Research, Vol. 38, Issue 1 Jan. 2003); in this article they pretty much take the entire procedure apart and show that it often fails to indicate taxons when they are there or indicates taxons when there is nothing.
I think the question of finding a taxon is strongly related to clustering, because it simply tries to answer if clusters exist in the data. However, from all the clustering literature I’ve read so far, clusters are generally defined as dense areas in a space and are found in various ways by maximizing or minimizing some criterion (mutual information etc.). What constitutes a cluster is often conveniently defined so it fits the algorithm at hand. And then you still have to deal with or at least acknowledge the fact that the current notion of clustering has been proven to be impossible (An Impossibility Theorem for Clustering; Kleinberg; NIPS 15).
In a new paper in Machine Learning called Generalization from Observed to Unobserved Features by Clustering (Krupka&Tishby; JMLR 9(Mar):339–370, 2008) the authors describe an idea that might change the way we view clustering. In the paper they show that (under certain conditions) given a clustering based on some features the items will be implicitly clustered by yet unobserved features as well. As an intuitive example, imagine apples, oranges and bananas clustered by shape, color, size, weight etc. Once you have them clustered, you will be able to draw conclusions about a yet unobserved feature, e.g. the vitamin content. The work, because it is oriented on the features, might even be a way around the impossibility-theorem.
This is half-way there for a nicer definition of a taxon or what should constitute a cluster for that matter: can we draw conclusions about features not used for the clustering process? If you are clustering documents by topic (using bag-of-words), can you predict which other words will appear in the article? If you cluster genes, can you predict other genes from the cluster-membership?
Re-clustering on only a subset of the features should also be a sanity check for clustering solutions (I had written about the McIntyre-Blashfield procedure and validating clustering solutions before). I think strong patterns should replicate with less features; at least they did in a clustering-study I did recently
.
I’ll be pondering this idea and try formalizing it. Maybe I can come up with something usable for taxometrics and a means to get the number of clusters…
Posted in Statistics, Clustering, Data Mining, Machine Learning | Print | No Comments »
Using Psychological Domain Knowledge for the Netflix Challenge
February 28, 2008 10:46 pm by Markus.
I read an interesting article today about using psychological domain knowledge for improving recommender-system predictions. A very interesting idea…
Posted in Statistics, Machine Learning, Psychology | Print | No Comments »
What Machine Learning Papers to read …
July 13, 2007 1:08 pm by Markus.
Laura just pointed me to this system, best described as:
I have a routine problem that sometimes paper titles are not enough to tell me what papers to read in recent conferences, and I often do not have time to read abstracts fully. This collection of scripts is designed to help alleviate the problem. Essentially, what it will do is compare what papers you like to cite with what new papers are citing. High overlap means the paper is probably relevant to you. Sure there are counter-examples, but overall I have found it useful (eg., it has suggested papers to me that are interesting that I would otherwise have missed). Of course, you should also read through titles since that is a somewhat orthogonal source of information.
http://www.cs.utah.edu/~hal/WhatToSee/
I have the same problem. And wow… I will have a lot to read this weekend.
Posted in Statistics, Machine Learning, Artificial Intelligence (AI) | Print | No Comments »
Back from AISTATS 2007
March 27, 2007 12:38 am by Markus.
Just got back home from AISTATS (Artificial Intelligence and Statistics). The conference was really interesting (more so than NIPS) and it’s unfortunate that it is only every two years. Some of the invited talks were way over my head, but I learned a lot from other people’s work and got new ideas …
Some of the coolest papers were (incomplete list and in no particular order; I need to organize my notes
But there were way more papers of interest to me than at NIPS):
- Nonlinear Dimensionality Reduction as Information Retrieval
- Venna Jarkko and Samuel Kaski
- Fast Low-Rank Semidefinite Programming for Embedding and Clustering
- Brian Kulis, Arun Surendran, and John C. Platt
- Local and global sparse Gaussian process approximations
- Edward Snelson, Zoubin Ghahramani
- A fast algorithm for learning large scale preference relations
- Vikas Raykar, Ramani Duraiswami, and Balaji Krishnapuram
- Deep Belief Networks
- Ruslan Salakhutdinov and Geoff Hinton
- Large-Margin Classification in Banach Spaces
- Ricky Der and Daniel Lee
One thing that couldn’t help but notice was how much research is now focusing on Semi-Definite Programs, either for dimensionality reduction or other purposes. Yet, there are not many efficient ways to compute SDPs. One paper presented a method based on quasi-Newton gradient descent, but it’s probably not good enough yet for large problems.
Other interesting papers I saw was about the unsupervised deep belief nets that learns a structure of the data which results in an interesting performance boost. The authors train a deep belief net (unsupervised) on the data and then train classifiers on the output; although all the results were compared to only linear techniques, they showed some impressive results. This reminded me of a similar idea I had a while ago that I never got to work; I tried to use label propagation methods to approximate a kernel matrix usable for SVMs and the like. It never worked, because my algorithm caused the SVMs to always overfit (despite being unsupervised - it took me a while to realize that doing something unsupervised is no guarantee that you won’t overfit your data). I’ll investigate some day what made all the difference in this case…
Another interesting bit was that approximating the Matrix Inverse by low-rank approximations leads to significant loss of accuracy for Gaussian Processes Error bars. This should be interesting for further research in the speedups for these and other algorithms that require a matrix inversion (e.g. semi-supervised label propagation algorithms).
Posted in Statistics, Machine Learning, Artificial Intelligence (AI) | Print | No Comments »
Validating patterns found by Data Mining techniques
February 22, 2007 2:50 pm by Markus.
I just read in the news about a study that showed with data mining that libras live longer (study from the Institute for Clinical Evaluative Sciences in Toronto). These guys did a study for fun using data from 10 million Ontario residents looking for associations between various health problems and their astrological signs. And they actually found associations! Each of the twelve astrological signs had at least two medical disorders associated with them. However, they were not able to replicate it when looking for the same pattern in the hold-out set.
This is a nice example of why it is important to have a hold-out set, or ideally to try to redo a study with a different method on a different set of similar data. A couple of other methods for validating have been proposed which are harder to do. In my data mining work with Tim we have used a cross-replication and cluster validation design from the book “Classification” by Gordon (1999). The original method (I think) was proposed by McIntyre and Blashfield (”A nearest-Centroid technique for evaluating the minimum variance clustering procedure”, Multivariate Behavioral Research 15, 1980; see also Milligan 1996 “Clustering Validation: results and implications for applied analysis, in “Clustering and Classification”; eds. Arabie, Hubert and De Soete, World Scientific, Singapore, p. 341). What I particularly like about the method is that it can be used to quantify in numbers “how much” of a replication you got. The method works like this:
- Divide the data randomly into two subsets
- Do your exploratory clustering work on set A, partitioning it into k many clusters
- Use a classifier (Nearest Neighbor for example) to assign each object in B to each discovered cluster
- Use the same clustering procedure on B, partitioning it into k many clusters
- Compare the two labelings obtained for B (the classification and clustering) in a cross-tabulation, compute a kappa-coefficient
Given our tendency to find patterns in data (or see shapes in clouds) I think it is important to use a procedure like the above to double-check the patterns discovered before any important decisions are made.
Posted in Statistics, Data Mining | Print | 2 Comments »
Artificial Intelligence and Sports
February 8, 2007 3:59 pm by Markus.
A couple of days ago Indianapolis won the Superbowl - just as predicted by an Electronic Arts Simulation. The simulation software had been fed with the latest data about all the players involved and they had the game AI fight it out. In the past some of the simulated outcomes were not that close to the final scores, but they still did a fairly decent job in 2005 and 2007.
There is more and more statistical decision making in baseball as well, the most famous example being the Miami-Orlando series in the 1997 playoffs.
Interesting…
Posted in Statistics, Data Mining, Machine Learning, Artificial Intelligence (AI) | Print | No Comments »
Sequential Sampling and Machine Learning
January 8, 2007 7:58 am by Markus.
In order to estimate an unknown quantity mu a common approach is to design an experiment that results in a random variable Z distributed within the interval [0,1]. The expectation E[Z]=μ can then be estimated by running this experiment independently, averaging the outcomes, and using Monte-Carlo techniques for the estimate. In (Dagum, Karp, Luby and Ross, SIAM Computing,1995) the AA algorithm (”Approximation Algorithm”) is introduced which, given epsilon and delta and independent experiments for the random variable Z, produces an estimate of the mean (or the true expectation) that is within the factor of 1+ε of μ with probability of success of at least 1-δ. Note that there are no distributional assumptions by the algorithm. This has a couple of applications in machine learning, for example in Bayesian Inference, Bayesian Networks and Boosting (Domingo and Watanabe, PAC-KDD, 2000).
The AA algorithm works in three steps. First, the stopping rule computes an initial estimate of the mean. Then, the variance is determined and, in the third step, additional samples are taken to approximate the expectation even further. A small improvement for the stopping rule in step one can be made as follows. The algorithm assumes a non-zero expectation and keeps sampling until the sum of the elements is larger than a constant determined by epsilon and delta (read the paper to see why that works). The problem is that the closer the elements are to zero, the more elements are needed.
Observe that the following holds for the mean:

With that one can improve the stopping rule as follows:

P.S.: To type Greek letters into Wordpress use the html named entities such as ε for ε. That took me forever …
Posted in Coding / Programming, Statistics, Machine Learning | Print | No Comments »
Online Dating
January 3, 2007 8:20 am by Markus.
As I’m currently visiting Germany over the winter break, I couldn’t help but notice the advertising for an online dating website here. They spend a lot of money to get this stuff into peoples heads here. I’ve seen some of that stuff advertised in the US (such as match.com and TRUE) so for the hell of it I went and checked out the website. First thing I noticed is that they require you to create an account to see peoples pictures or browse more than a couple of pages in the search results. That, of course, leads to many many stale profiles from people that just want to window-shop and are not really interested in giving it a serious try. To interested parties (i.e. people that pay) this of course might look like there are so many members on the website that it might be worth paying for.
It just helps add to my impression after reading about Bad Experiences with canceling accounts, which gives a not-so-honorable mention to certain US based dating websites. Apparently you can’t just cancel your membership using the website, but have to take a phone-exit interview. Otherwise, your profile will be kept and your credit card will be continue to be charged. It seems that dating websites are forced to keep people active as long as possible (or at least keep up the illusion). The reason for this might be less mean-spirited than one would at first assume. For example, just to have a couple of thousand people in each major city of the US a dating website would have to have roughly 100.000 active members. That is tough to accomplish, esp. given that without the illusion of activity nobody else would join.
With all that said, a friend of mine found his girlfriend through the Denver Personals on Craigslist. It can work.
Posted in Society, Statistics, Ramblings, Random | Print | 4 Comments »
Random Number Books
October 15, 2006 3:22 pm by Markus.
A friend of mine just pointed me to the book of A Million Random Digits with 100,000 Normal Deviates. It’s a book full of random numbers from back in the days when random number generation was hard. However, check out the reviews at Amazon of the book. They are hilarious.
Posted in Statistics, Ramblings, Random | Print | No Comments »
Las Vegas and Roulette
September 17, 2006 6:23 pm by Markus.
I just got back from a brief stay in Las Vegas. I didn’t gamble, but got thinking about the game of Roulette. Obviously all the games that are played rely on unpredictability in one way or another and require real randomness, just like many cryptography applications such as SSL. If the state of, say the deck of cards in a black jack game would be known, the outcome becomes predictable (see also: cracking the Netscape SSL random number generator). The black jack card counting thing is getting old (and I noticed that they cut the deck of cards and throw away a number of cards - which leaves the gambles with no information what cards are now in the deck), but recently I read about people using a laser scanner to predict the roulette wheel outcome. An interesting bit I noticed was how Roulette is played differently here in the US. Besides that there are two zeros on the table (giving the bank an even greater chance to win) the croupier changes the spin of the wheel (he/she stops the wheel and starts spinning it again) and puts the little ball in after all the bets have been made. I recall that in Europe they spin the wheel and put the ball in and then allow for bets being made for a certain amount of time. That certainly makes it impossible to predict anything at all at the time of betting (unless the wheel would be biased somehow, say by a slight tilt). Also the displays that show the last couple of numbers seem to malfunction occasionally and display the wrong numbers. I guess the only way to win in a casino would be if any of their “random number generators” were unbeknownst to them were not truly random - but then again they won’t like it if people win.
Funny sidenote… you can find amazing, “infallable” Roulette Systems on the web for only $27
Then there others that are available for free. For example, the pivot system. The inventor claims that :”It is a fact that numbers on a roulette wheel tend to repeat often.” Some of them might even work (play simple chances, double your bet everytime you loose) assuming that there would be no limit at the table (and ignoring that the house still has an edge).
Posted in Statistics, Ramblings, Random | Print | No Comments »