Archive for the ‘Machine Learning’ Category

Choosing the right features for Data Mining

Friday, June 1st, 2007

I’m fascinated by a common problem in data mining: how do you pick variables that are indicative of what you are trying to predict? It is simple for many prediction tasks; if you are predicting some strength of a material, you sample it at certain points using your using your understanding of physics and material sciences. It is more fascinating with problems that we would believe to understand, but don’t. For example, what makes a restaurant successful and guarantees repeat business? The food? The pricing? A friendly wait-staff? Turns out a very big indicator for restaurant success is the lighting. I admit that lighting didn’t make it into my top ten list… If you now consider what is asked on your average “are you satisfied with our service” questionnaire you can find in a various restaurant chains, then I don’t recall seeing anything about the ambiente on it. We are asking the wrong questions.

There are many other problems in real life just like this. I read a book called Blink and the point the author is trying to make is that making subconscious decisions are easy to make – once you know what to look for. More information is not always better. This holds for difficult problems such as judging the effectiveness of teachers (IIRC seeing the first 30 seconds of a videotape of him/her entering a classroom is as indicative as watching hours of recorded lectures). Same holds true for prediction problems about relationships – how can you predict if a couple will still be together 15 years later? Turns out there are four simple indicators to look for, and you can do it in maybe 2 minutes of watching a couple… The book is full of examples like that, but does not provide a way to “extract the right features”. I have similar problems with the criminology stuff I’m working on; while we get pretty good results using features suggested by the criminology literature I’m wondering if we have the right features. I’m still thinking that we could improve our results if we had more data – or the “right” data I should say (it should be obvious that more is not better by now). How do you pick the features for problems? Tricky question…

There is only data mining system that does not have this problems: recommender systems. Using recommender systems can avoid the problem as they do not rely on particular features to predict, but exploit correlations in “liking”. A classical example was that people that like classical music often like jazz as well – something you wouldn’t easily be able to predict from features you extract from the music. I wonder if we could reframe some prediction problems in ways more similar to recommender systems, or maybe make better use of meta-information in certain problems. What I mean with “meta-information” is easily explained with an example: Pagerank. It is so successful in web-scale information retrieval because it does not bother with trying to figure out if a page is relevant by keyword ratios and what not, but simply measuring the popularity by how many important pages link to it (before link-spam became a problem that is). I wish something simple like that would be possible for every problem 🙂

Back from AISTATS 2007

Tuesday, March 27th, 2007

Just got back home from AISTATS (Artificial Intelligence and Statistics). The conference was really interesting (more so than NIPS) and it’s unfortunate that it is only every two years. Some of the invited talks were way over my head, but I learned a lot from other people’s work and got new ideas …

Some of the coolest papers were (incomplete list and in no particular order; I need to organize my notes 🙂 But there were way more papers of interest to me than at NIPS):

  1. Nonlinear Dimensionality Reduction as Information Retrieval
    • Venna Jarkko and Samuel Kaski
  2. Fast Low-Rank Semidefinite Programming for Embedding and Clustering
    • Brian Kulis, Arun Surendran, and John C. Platt
  3. Local and global sparse Gaussian process approximations
    • Edward Snelson, Zoubin Ghahramani
  4. A fast algorithm for learning large scale preference relations
    • Vikas Raykar, Ramani Duraiswami, and Balaji Krishnapuram
  5. Deep Belief Networks
    • Ruslan Salakhutdinov and Geoff Hinton
  6. Large-Margin Classification in Banach Spaces
    • Ricky Der and Daniel Lee

One thing that couldn’t help but notice was how much research is now focusing on Semi-Definite Programs, either for dimensionality reduction or other purposes. Yet, there are not many efficient ways to compute SDPs. One paper presented a method based on quasi-Newton gradient descent, but it’s probably not good enough yet for large problems.

Other interesting papers I saw was about the unsupervised deep belief nets that learns a structure of the data which results in an interesting performance boost. The authors train a deep belief net (unsupervised) on the data and then train classifiers on the output; although all the results were compared to only linear techniques, they showed some impressive results. This reminded me of a similar idea I had a while ago that I never got to work; I tried to use label propagation methods to approximate a kernel matrix usable for SVMs and the like. It never worked, because my algorithm caused the SVMs to always overfit (despite being unsupervised – it took me a while to realize that doing something unsupervised is no guarantee that you won’t overfit your data). I’ll investigate some day what made all the difference in this case…

Another interesting bit was that approximating the Matrix Inverse by low-rank approximations leads to significant loss of accuracy for Gaussian Processes Error bars. This should be interesting for further research in the speedups for these and other algorithms that require a matrix inversion (e.g. semi-supervised label propagation algorithms).

EM-Clustering when double isn’t enough…

Tuesday, February 13th, 2007

One of the more interesting developments in clustering (in my opinion) is clustering of data the is on a unit hypersphere. It sounds like some rare special case at first, but appears quite frequently in real life applications such as Spectral Embeddings in Spectral Clustering, some subdomains of Bio-Informatics data or text-data in TFIDF representation. The data can be analyzed as unit vectors on a d-dimensional hypersphere, or equivalently are directional in nature. Spectral clustering techniques generate embeddings in the normalization step that constitute an example of directional data and can result in different shapes on a hypersphere.

The first paper published that suggested a good clustering algorithm presented an Expectation Maximization (EM) algorithm for the von Mises-Fisher distribution (Banerjee et.al, JMLR (6) 2005). Avleen and myself started to work on extension for this that utilizes the Watson distribution, a distribution for directional data that has more modeling capability than the simple von Mises-Fisher distribution. We just published our results for the Watson EM clustering algorithm in the AISTATS 2007 conference to be held in March (Matlab code will be available soon).

One problem with both algorithms is that they require a high precission number representation in order to work well for high-dimensional problems such as bio-informatics data and text. Most prior work with directional data was limited to maybe 3 dimensional cases, and most Kummer-function approximations (another problem we had to address) work well only for the lower dimensional cases. In our AISTATS paper we only presented results for lower dimensional embeddings as we had some problems getting it to work for higher dimensional data (also, the root-solver that was involved is just incapable of handling larger problems). We have been working on a speedup with some success, but I have to say that it was mostly the numerical problems that gave us a hard time.

More and more Machine Learning techniques require a more careful consideration of numerical problems (Support Vector Machines, my manifold clustering algorithm etc.) and I run into numerical problems every other day. While trying to improve our Watson-EM algorithm I found out that Continued Fractions have many desirable properties such as the unique, finite representation of rational numbers. Numbers can be represented exact with no numerical error. In the EM algorithm we use them to approximate the Kummer function. Maybe a more exact number representation for fractions can be made out of this?

So I started looking and found in an tech-report that the number representation of continued fractions can be nicely implemented in Prolog. It also explains how to add up numbers in a Continued Fraction Representation and so on with an arbitrary precision.

I haven’t found any papers yet that suggest a suitable hardware implementation of Continued Fractions to replace the IEEE floating point numbers we use nowadays, but it can probably be done.

Artificial Intelligence and Sports

Thursday, February 8th, 2007

A couple of days ago Indianapolis won the Superbowl – just as predicted by an Electronic Arts Simulation. The simulation software had been fed with the latest data about all the players involved and they had the game AI fight it out. In the past some of the simulated outcomes were not that close to the final scores, but they still did a fairly decent job in 2005 and 2007.

There is more and more statistical decision making in baseball as well, the most famous example being the Miami-Orlando series in the 1997 playoffs.

Interesting…

Sequential Sampling and Machine Learning

Monday, January 8th, 2007

In order to estimate an unknown quantity mu a common approach is to design an experiment that results in a random variable Z distributed within the interval [0,1]. The expectation E[Z]=μ can then be estimated by running this experiment independently, averaging the outcomes, and using Monte-Carlo techniques for the estimate. In (Dagum, Karp, Luby and Ross, SIAM Computing,1995) the AA algorithm (“Approximation Algorithm”) is introduced which, given epsilon and delta and independent experiments for the random variable Z, produces an estimate of the mean (or the true expectation) that is within the factor of 1+ε of μ with probability of success of at least 1-δ. Note that there are no distributional assumptions by the algorithm. This has a couple of applications in machine learning, for example in Bayesian Inference, Bayesian Networks and Boosting (Domingo and Watanabe, PAC-KDD, 2000).

The AA algorithm works in three steps. First, the stopping rule computes an initial estimate of the mean. Then, the variance is determined and, in the third step, additional samples are taken to approximate the expectation even further. A small improvement for the stopping rule in step one can be made as follows. The algorithm assumes a non-zero expectation and keeps sampling until the sum of the elements is larger than a constant determined by epsilon and delta (read the paper to see why that works). The problem is that the closer the elements are to zero, the more elements are needed.

Observe that the following holds for the mean:

With that one can improve the stopping rule as follows:

P.S.: To type Greek letters into WordPress use the html named entities such as ε for ε. That took me forever …

Just got back from NIPS 2006

Tuesday, December 12th, 2006

Just got back in town from the NIPS conference. I’ve been to a couple of Machine Learning conferences before, but this was my first time at NIPS. A couple of papers were very interesting (you can download them at books.nips.cc) :

  • Manifold Denoising
    Matthias Hein, Markus Maier
  • Fundamental Limitations of Spectral Clustering Methods
    Boaz Nadler, Meirav Galun
  • Learning with Hypergraphs: Clustering, Classification, and Embedding
    Dengyong Zhou, Jiayuan Huang, Bernhard Schoelkopf
  • Recursive Attribute Factoring
    David Cohn, Deepak Verma, Karl Pfleger

However, I found the single-track style of the conference boring at times. My interest in the latest results from fMRIs etc. is low right now, so at times there was nothing to do, but mingle or just do nothing. At ICML there is always at least one conference-track that is interesting to me. The poster sessions at NIPS were very interesting, though.

The workshops were more interesting than the conference. Only the room-sizes were misallocated. Some workshops (the one with the big rooms) were rather empty, and the ones I attended were overcrowded. And, of course, the traditional workshop summarys at the end of the workshop were funny. The ones that stuck out in my mind the most were Man vs. Bird from the Acoustic Processing Workshop and the novel applications for the non-linear dimensionality reduction with their swiss-roll video. I got a few new ideas from the workshops that maybe will work out.

Also there were no T-Shirts. At this years ICML plenty of free t-shirts were given out – unfortunately during the reception, which forced everybody to carry their T-Shirts around during the entire reception (it looked very amusing, though) – yet at NIPS all we got was a mug… 🙂

Last, but not least, I’ve heard about the legendary NIPS partys from my friends and had some high expectations :-). Friday night I attended the GURU-party from Garry’s Unbelievable Research Unit, Saturday was the legendary Gatsby-party. Both partys were rather disappointing, so I actually went to check out the nightlife in Whistler instead. The bars and clubs I found were pretty quiet as well. Uh well… I’ve heard from people that Whistler had less people than last year around that time.

Ensemble Predictors and Democracy

Wednesday, November 15th, 2006

I just read an interesting article about how society is usually described in science fiction. Turns out that in all circumstances it is about a very hierarchical, military like structure. There are no suggestions as to how a civilian society might work in the future. Consider things like Star Trek where a bunch of officers command a star ship around and the rest of the people just function. The captain is smart, benevolent and there is rarely an abuse of power. No democracy, no voting, little about how the civilian society of the future might function. There are things like Futarchy, but that’s pretty much all I could find in a quick search (and it wasn’t proposed in a SciFi-novel so it can’t be any good 🙂 ). One of the problems with Democracy might be that people don’t always make the right decision as they don’t have access to all the information or are easily swayed by bad arguments (e.g. negative ads – some of them are just factually wrong). My point is that there haven’t been that many viable alternatives proposed, not even some crazy, outlandish suggestions (think teleportation for means of transport) to give people some new ways to think about this.

There is an interesting book out there called The Wisdom of Crowds. It proposes that large crowds of people can be capable of making decissions better than individuals. Long story short, according to the book four key qualities are necessary to make a crowd smart. The crowd needs to be diverse, so that people are bringing different pieces of information to the table. It must not have somebody at the top dictating the crowd’s answer, and summarize people’s opinions into one collective verdict. The people in the crowd need to be independent, so that they pay attention mostly to their own information, and not worrying about what everyone around them thinks (i.e. being immune to persuasion concepts like social proof).

Random Forests in machine learning are an ensemble method that has very good classification performance. The way it works is that hundreds of decision trees are build, but each on a different training set and with a different choice of features. If all the classifiers are strong (i.e. not able to make perfect predictions, but they tend to do the right thing – they take the information they have and make independent decisions) , then the overall vote of all the trees in the ensemble will tend to minimize the misclassification error. Breiman gave a mathematical proof of why this minimizes the classification error (i.e. bad decisions).

I wonder if something like this might work for political decision making. Leaving problems like corruption and other human fallacies (e.g. looking at what others are doing) aside for a moment and assuming that for all things there are good arguments to be made for and against a bill, a senators vote would depend on how he or she weights the particular arguments for and against the bill. If we assume that senators tend to vote for what they perceive to be the right thing, would giving each senator a random subset of information make the overall senate vote for the “right thing”? Another idea would be to make a political decision, similar to jury duty, by picking a large number of people from the general population at random and have them decide on a particular issue.
Edit:I found some criticism of the Wisdom-of-Crowds theory, such as Wikipedia not being accurate enough or a democracy electing people like Hitler. A good question in both cases would be if people made their decisions independently in these cases or not. I think that independent decisions are difficult to achieve in practice. Also one has to wonder how robust this system is due to the assumption that everybody makes the best decission they can.

Google Co-Op

Thursday, October 26th, 2006

Google recently started their new Co-Op service (see here) which allows users to create their own personalized search-engines to whichever topic they desire. More importantly, it allows for inclusion of the search into the website of the whoever created the search-engine. Especially the later is nice as before it was only possible to either drive the user away to the Googles’ homepage, open it in an iframe or a new window – all stuff that webmasters wouldn’t want for one reason or the other. I just created a search-engine for machine-learning related materials. I’m still adding sites, but I’m getting better results for my personal searches already. I also wrote a module for the PHPNuke content management system to include whichever personal CoOp search-engine you created into the system. You can download the Search-The-Web-module for PHP-Nuke on my website. It even comes with English, German and French language-files (although the French translation might be a bit funny).

Data mining used to find new materials

Sunday, August 27th, 2006

 I just read an Eureka Alert (see also ZDNet’s blog)mentioning that a couple of researchers at MIT found new, potentially useful crystal structures with AI and Data Mining techniques. You can find the abstract of their paper here. I’ve seen randomness and Genetic Algorithms around alot lately (such as the self-reconfigurable-modular-robot/) and a robot that can do bioinformatics experiments (DNA sequencing) all by himself (link?). I think that this is a very useful application of AI. However, it is only an application of the scientific knowledge. It’s fast testing based on the current physical models and insights. It automates science to an extend, but does not come up with new insights. It’s more data without more people to add an interpretation. For example, it took a few years before somebody found an application for Teflon.

I haven’t seen this around (will search again), but what would be really interesting is an algorithm that can form a new hypothesis (e.g. a differential equation) based on outcomes from Physics experiments. An algorithms that explains the data and forms a theory. It’s probably harder to build than regression algorithms…

 

Â