Markus Breitenbach

Category: Machine Learning

The cloud obscuring the scientific method

“All models are wrong, and increasingly you can succeed without them” — George Box
“Sometimes…” — Me

In a Wired article about the Peta-byte age of data processing the author claimed that given the enormous amounts of data and the patterns found by data mining we are less and less dependent on scientific theory. This has been strongly disputed (see Why the cloud cannot obscure the Scientific Method) as the author simply ignores the fact that all the patterns that were found are not necessarily exploitable – finding a group of genes that interact is a first step, but won’t cure cancer. However, in machine translation or placing advertising online one can succeed with little to no domain knowledge. That is, once somebody comes up with the right features to use (see Choosing the right features for Data Mining).

What would be interesting to develop, however, is a “meta-learning” algorithm that can abstract from simpler models and learn e.g. a differential equation. For example, lets take data from several hundred Physics experiments about heat-distribution conducted on different surfaces etc. We can probably learn a regression model for one particular experiment which could predict how the heat will distribute given the parameters of the experiment (material, surface etc.). The meta-learning algorithm would then look at these models and somehow come up with the heat-equation. That would be something…

July 12, 2008 4:41 pm
Using Psychological Domain Knowledge for the Netflix Challenge

I read an interesting article today about using psychological domain knowledge for improving recommender-system predictions. A very interesting idea…

February 28, 2008 10:46 pm
Back from NIPS 2007
Just got back home from NIPS. The following papers I found pretty interesting:
- Random Features for Large-Scale Kernel Machines
- On Ranking in Survival Analysis: Bounds on the Concordance Index
- Efficient Inference for Distributions on Permutations
The workshop time I spend in the Learning Problem design and in the Security workshop. I also dropped by in “statistical networks” briefly, but there’s room for improvement in my current understanding of Gibbs sampling and the like. The consensus in the problem design workshop seemed to be that machine learning must become more modular. Also there was agreement that the application of machine learning in the real world requires some magic for transforming the problem into “features” and some more magic for transforming the prediction into something useful. It was stating the obvious a bit, however not much progress has been made in this area of making ML more accessible. I wrote about choosing the right features before, but currently it’s more of an art than a science. One thing I took from the security workshop was that features must be easily constructed (most detection apps must run in real-time). This means we are interested in features the attacker can hardly influence (think received-headers in Spam-emails that can not be suppressed), yet they must be easily to compute.

Also really cool was the “NIPS Elevator Process” joke-paper about hungry scientists on the way to lunch (don’t confuse it with the Chinese restaurant process) and the party crashers at the Gatsby party. Sophie and some friends of hers simply joined the fun. The fun part was people taking her random answers for her research topic seriously 🙂 I got mistaken for one of the party crashers at one point, because I didn’t fit in with my clothing. I was actually planning on hitting a club in downtown Whistler, but didn’t get around to go in the end…
December 11, 2007 1:37 am
The GPL and Machine Learning Software – Should the GPL cover training data?

I’ve followed the discussion and introduction of the GPL v3 for a bit. One major change in the license is supposed to close the loophole commonly referred to as the “tivoization” of GPL software, i.e. mechanisms that prevent people from tinkering with the product they bought which includes GPL software. Tivo, in particular, accomplishes this by requiring a valid cryptographic signature for the code to run – the user has access to the code, but it’s of no use. One of the main ideas of the GPL was to allow people the freedom to tinker, improve and understand how something works. This got me thinking a bit about software that uses machine learning techniques.

For the sake of the argument, let’s assume that somebody releases a GPL version of a speech recognition system, or say an improved version of a GPL speech recognition system. While the algorithms would be in the open for everyone to see, two major components of speech recognition systems, the Acoustic Model and Language Model, do not have to be. The Acoustic Model is created by taking a very large number of audio recordings of speech and their transcriptions (Speech Corpus) and ‘compiling’ them into statistical representations of the sounds that make up each word. The Language Model is a very large file containing the probabilities of certain sequences of words in order to narrow down the search for what was said.

A big part of how well the speech recognition system works relies on the training. The author who improved upon the software should publish the training set as well. Otherwise people won’t be able to tinker with the system or understand why the software works well.

The same would hold for things like a handwriting recognition system. One could publish it along with a model (a bunch of numbers) that make the recognition work. It would be pretty hard for somebody to reverse engineer what the training examples were and how training was conducted to make the system work. Getting the training data is the expensive part in building speech-recognition and handwriting-recognition systems.

Think Spam-Assassin – what if the authors suddenly decide to not make their training corpus available anymore? How would users be able to determine the weights for the different tests?

I don’t think this case is covered by the GPL (v3 or older) – (However, I’m not a lawyer). Somebody could include the model in C code (i.e. define those weights for a Neural Net as a bunch of consts) and then argue that all is included to compile the program from scratch as per the requirements of the license. However, the source by itself wouldn’t allow anybody to understand or change (in a meaningful way) what the program is doing. With the growing importance of machine learning methods just being able to recompile something won’t be enough. I think this should be taken into consideration by the open source community for GPL v3.01.

October 1, 2007 10:19 pm
I passed my PhD defense

Hurray…

September 19, 2007 7:32 pm
Advertising and Data Mining

Lately I’ve become fascinated with the field of Advertising and Marketing. Honestly I’ve never paid much attention to ads before as most of them were just not interesting to me or were just plain horrible. I just finished reading a good book about the choice of headlines and direct marketing and how it is actually tested what gets the most responses (“Tested Advertising Methods”, John Caples).

There’s obviously a lot that could be done with machine learning. For example, some kind of predictor that learns from previous ads (or possibly over ads from various companies) how successful they have been and then predicts return rates for new ads. Google and Yahoo probably do something like this already …

August 30, 2007 1:44 pm
Human Intuition vs. Statistical Models

I just came across a very interesting book announcement for “Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart” by Ian Ayres, a professor Yale Law School and econometrician. In the book (I haven’t read it yet, but I will) the author argues that intuition is losing ground to statistical methods and data mining. According to the Amazon abstract he gives examples from the airline industry, medical diagnostics and even online dating services showing that a statistical model will outperform human intuition.

That machines can outperform human judgement has been known for quite some time. For example, in the field of psychology the diagnosis of mental disorders is more or less standardized by them DSM. There was a very interesting meta-analysis that showed that a mechanical predictor always outperformed the human psychologist. To be specific: Grove, W.M., Zald, D.H., Hallberg, A.M., Lebow, B., Snitz, E., & Nelson, C. (2000). Clinical versus mechanical prediction: A meta-analysis. Psychological Assessment, 12, 1930. To quote from the Abstract: “On average, mechanical-prediction techniques were about 10% more accurate than clinical predictions. Depending on the specific analysis, mechanical prediction substantially outperformed clinical prediction in 33%-47% of studies examined. Although clinical predictions were often as accurate as mechanical predictions, in only a few studies (6%-16%) were they substantially more accurate. Superiority for mechanical-prediction techniques was consistent, regardless of the judgment task, type of judges, judges’ amounts of experience, or the types of data being combined.”

I’m a little bit skeptical about using data crunching to decide important questions (as in life and death questions). In general it seems like a good idea, but it always comes down to how you model the data and how you model the question to be answered. In many cases this might be obvious, in others not so much. The art is then to model the data, not the application of the algorithm or technique. It reminds me a bit of a class about formal program verification I took back in Darmstadt. Stefan, the TA of the class, and I had an argument about the use of practicability of program verification. He gave the unix find utility as an example for which you can show – more or less – easily that the program will terminate while enumerating all the files in all the directories in the system, and how find can be nicely modeled with a well-founded relation to show the termination of the algorithm. I objected that I could set a symbolic link to a uper-level directory (which is why find does not by default follow them) and could make find go in circles. Stefan conceded, “Oh well, I guess then the model was wrong…”. Similar things have happened in e.g. Cryptography, where a finite-state model (sorry, lost the citation somewhere; I’m not quite sure if that was the Usenix paper from the Stanford guys I read or something else) showed that the SSL protocol (Secure Socket Layer) is secure. Later the protocol was broken nonetheless (Schneier, Bruce; Wagner, David; Analysis of the SSL 3.0 protocol).

I think that with the wrong model you can show a lot of good things about anything. Once you abstract from the real world and build a model you might just have ignored that little most important feature. Maybe it is time for a best-practices in data modeling and data mining (there are already some books out there for some specific domains) …

August 25, 2007 10:01 pm
What Machine Learning Papers to read …

Laura just pointed me to this system, best described as:

I have a routine problem that sometimes paper titles are not enough to tell me what papers to read in recent conferences, and I often do not have time to read abstracts fully. This collection of scripts is designed to help alleviate the problem. Essentially, what it will do is compare what papers you like to cite with what new papers are citing. High overlap means the paper is probably relevant to you. Sure there are counter-examples, but overall I have found it useful (eg., it has suggested papers to me that are interesting that I would otherwise have missed). Of course, you should also read through titles since that is a somewhat orthogonal source of information.

http://www.cs.utah.edu/~hal/WhatToSee/

I have the same problem. And wow… I will have a lot to read this weekend.

July 13, 2007 1:08 pm
Back from ICML …
Just got back in town from ICML 2007, had a blast and met lots of old friends. This year it felt a bit more like a camping trip with no hot water and filthy bathrooms. Otherwise I learned a lot, specifically the following papers were in my opinion the most interesting (in no particular order):
- Pegasos: Primal Estimated sub-GrAdient SOlver for SVM (Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro):a very fast online SVM with good bounds, kernelizable. Code available. Most impressive results and probably useful for the robot stuff we are working on.
- A Kernel Path Algorithm for Support Vector Machines (Gang Wang, Dit-Yang Yeung, Frederick Lochovsky). Speed up SVM learning by not having to re-train when the Kernel Sigma is changed. I hope they will make some code available 🙂
- Restricted Boltzmann Machines for Collaborative Filtering (Ruslan Salakhutdinov, Andriy Mnih, Geoffrey Hinton). Now #4 at the Netflix Challenge. I already wrote in my AISTATS post that I think this technique has a lot of potential.
- Multi-Armed Bandit Problems with Dependant Arms (Sandeep Pandey, Deepayan Chakrabarti, Deepak Agarwal) A Clustering trick to distribute rewards and speedup reinforcement learning in instances of banner advertisings
- CarpeDiem: an Algorithm for the Fast Evaluation of SSL Classifiers (Roberto Esposito, Daniele Radicioni) A fast Viterbi
- Graph Clustering with Network Structure Indices (Matthew Rattigan, Marc Maier, David Jensen) Fast, simple graph-mining algorithms. Since I’m currently reading “Mining Graph Data”….
- A Dependance Maximization View of Clustering (Le Song, Alex Smola, Arthur Gretton, Karsten Borgwardt) An interesting, general framework for Clustering using the Hilbert-Schmidt Independence Criterion that makes many clustering algorithms (K-Means, Spectral Clustering, Hierarchical Clustering) mere special cases…
- Neighbor Search with Global Geometry: A Minimax Message Passing Algorithm (Kye-Hyeon Kim, Seungjin Choi). Interesting idea…
I just notice that my paper-list is exceptionally long this time. So I did get a lot of cool new things out of it; I’m glad I went. I will hopefully be able to try some of my ideas soon(mostly busy with thesis writing right now)…
June 25, 2007 7:02 pm
Interesting Experimental Captchas

Captchas are these little word-puzzles in images that web-sites use to keep spammers and bots out. They are everywhere and even the New York Times had an article about Captchas recently. It turns out it’s a nice exercise in applying some machine learning to break these things (with lots of image manipulation to clean up the images). Since spam-bots are becoming smarter, people are switching to new kinds of Captchas. My favorites (using images) so far are Kittenauth and a 3D-rendered word-captcha.

June 11, 2007 3:11 pm