Archive for June, 2007

Back from ICML …

Monday, June 25th, 2007

Just got back in town from ICML 2007, had a blast and met lots of old friends. This year it felt a bit more like a camping trip with no hot water and filthy bathrooms. Otherwise I learned a lot, specifically the following papers were in my opinion the most interesting (in no particular order):

  • Pegasos: Primal Estimated sub-GrAdient SOlver for SVM (Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro):a very fast online SVM with good bounds, kernelizable. Code available. Most impressive results and probably useful for the robot stuff we are working on.
  • A Kernel Path Algorithm for Support Vector Machines (Gang Wang, Dit-Yang Yeung, Frederick Lochovsky). Speed up SVM learning by not having to re-train when the Kernel Sigma is changed. I hope they will make some code available 🙂
  • Restricted Boltzmann Machines for Collaborative Filtering (Ruslan Salakhutdinov, Andriy Mnih, Geoffrey Hinton). Now #4 at the Netflix Challenge. I already wrote in my AISTATS post that I think this technique has a lot of potential.
  • Multi-Armed Bandit Problems with Dependant Arms (Sandeep Pandey, Deepayan Chakrabarti, Deepak Agarwal) A Clustering trick to distribute rewards and speedup reinforcement learning in instances of banner advertisings
  • CarpeDiem: an Algorithm for the Fast Evaluation of SSL Classifiers (Roberto Esposito, Daniele Radicioni) A fast Viterbi
  • Graph Clustering with Network Structure Indices (Matthew Rattigan, Marc Maier, David Jensen) Fast, simple graph-mining algorithms. Since I’m currently reading “Mining Graph Data”….
  • A Dependance Maximization View of Clustering (Le Song, Alex Smola, Arthur Gretton, Karsten Borgwardt) An interesting, general framework for Clustering using the Hilbert-Schmidt Independence Criterion that makes many clustering algorithms (K-Means, Spectral Clustering, Hierarchical Clustering) mere special cases…
  • Neighbor Search with Global Geometry: A Minimax Message Passing Algorithm (Kye-Hyeon Kim, Seungjin Choi). Interesting idea…


I just notice that my paper-list is exceptionally long this time. So I did get a lot of cool new things out of it; I’m glad I went. I will hopefully be able to try some of my ideas soon(mostly busy with thesis writing right now)…

Text Mining for Tax Evaders on eBay

Tuesday, June 12th, 2007

A (long) while ago there was a lot of talk about the IRS doing text-mining for people that did not report income they got from e.g. ebay internet auctions. The German version of this attempt is called XPIDER (see also here and there) was just shown to be totally ineffective. After years of trying and spending millions they did not identify conclusive evidence to go after a single tax-evader. The German GAO (Bundesrechnungshof) is trying to find out what went wrong, why and who misspend all that money. I wonder if the more sophisticated web-crawling program the Canadian IRS is using (it’s called XENON as reported by Wired) is similarly effective …

Interesting Experimental Captchas

Monday, June 11th, 2007

Captchas are these little word-puzzles in images that web-sites use to keep spammers and bots out. They are everywhere and even the New York Times had an article about Captchas recently. It turns out it’s a nice exercise in applying some machine learning to break these things (with lots of image manipulation to clean up the images). Since spam-bots are becoming smarter, people are switching to new kinds of Captchas. My favorites (using images) so far are Kittenauth and a 3D-rendered word-captcha.

Choosing the right features for Data Mining

Friday, June 1st, 2007

I’m fascinated by a common problem in data mining: how do you pick variables that are indicative of what you are trying to predict? It is simple for many prediction tasks; if you are predicting some strength of a material, you sample it at certain points using your using your understanding of physics and material sciences. It is more fascinating with problems that we would believe to understand, but don’t. For example, what makes a restaurant successful and guarantees repeat business? The food? The pricing? A friendly wait-staff? Turns out a very big indicator for restaurant success is the lighting. I admit that lighting didn’t make it into my top ten list… If you now consider what is asked on your average “are you satisfied with our service” questionnaire you can find in a various restaurant chains, then I don’t recall seeing anything about the ambiente on it. We are asking the wrong questions.

There are many other problems in real life just like this. I read a book called Blink and the point the author is trying to make is that making subconscious decisions are easy to make – once you know what to look for. More information is not always better. This holds for difficult problems such as judging the effectiveness of teachers (IIRC seeing the first 30 seconds of a videotape of him/her entering a classroom is as indicative as watching hours of recorded lectures). Same holds true for prediction problems about relationships – how can you predict if a couple will still be together 15 years later? Turns out there are four simple indicators to look for, and you can do it in maybe 2 minutes of watching a couple… The book is full of examples like that, but does not provide a way to “extract the right features”. I have similar problems with the criminology stuff I’m working on; while we get pretty good results using features suggested by the criminology literature I’m wondering if we have the right features. I’m still thinking that we could improve our results if we had more data – or the “right” data I should say (it should be obvious that more is not better by now). How do you pick the features for problems? Tricky question…

There is only data mining system that does not have this problems: recommender systems. Using recommender systems can avoid the problem as they do not rely on particular features to predict, but exploit correlations in “liking”. A classical example was that people that like classical music often like jazz as well – something you wouldn’t easily be able to predict from features you extract from the music. I wonder if we could reframe some prediction problems in ways more similar to recommender systems, or maybe make better use of meta-information in certain problems. What I mean with “meta-information” is easily explained with an example: Pagerank. It is so successful in web-scale information retrieval because it does not bother with trying to figure out if a page is relevant by keyword ratios and what not, but simply measuring the popularity by how many important pages link to it (before link-spam became a problem that is). I wish something simple like that would be possible for every problem 🙂