You are currently browsing the archives for the Machine Learning category.
| M | T | W | T | F | S | S |
|---|---|---|---|---|---|---|
| « Apr | ||||||
| 1 | 2 | 3 | 4 | |||
| 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 | 31 | |
- Advertising (1)
- Artificial Intelligence (AI) (8)
- Coding / Programming (6)
- Cryptography (1)
- Data Mining (10)
- ewrt linux (2)
- Fixing Stuff (5)
- Machine Learning (18)
- Math (1)
- Politics (2)
- Psychology (3)
- Ramblings (18)
- Random (6)
- Security (11)
- Society (9)
- Sociology (3)
- spam (2)
- Statistics (9)
- April 21, 2008 1:38 am: ART OF SEDUCTION: Not Pretty, Really
- March 25, 2008 2:25 am: "Internal Server Error" when converting phpBB v2 to phpBB v3
- March 6, 2008 1:29 am: Firewire and DRM
- February 28, 2008 10:46 pm: Using Psychological Domain Knowledge for the Netflix Challenge
- February 12, 2008 1:24 am: VPN Tunels from within VMWare (Windows XP and GRE weirdness)
- February 2, 2008 5:59 pm: License Key Copy Protection
- January 8, 2008 8:34 pm: Registering Domains with Network Solutions
- January 7, 2008 10:22 pm: Joe-job ...
- December 11, 2007 1:37 am: Back from NIPS 2007
- November 24, 2007 1:03 am: GMail Logout Strangeness
Blogroll
Useful Links
Archive for the Machine Learning Category
Using Psychological Domain Knowledge for the Netflix Challenge
February 28, 2008 10:46 pm by Markus.
I read an interesting article today about using psychological domain knowledge for improving recommender-system predictions. A very interesting idea…
Posted in Statistics, Machine Learning, Psychology | Print | No Comments »
Back from NIPS 2007
December 11, 2007 1:37 am by Markus.
Just got back home from NIPS. The following papers I found pretty interesting:
- Random Features for Large-Scale Kernel Machines
- On Ranking in Survival Analysis: Bounds on the Concordance Index
- Efficient Inference for Distributions on Permutations
The workshop time I spend in the Learning Problem design and in the Security workshop. I also dropped by in “statistical networks” briefly, but there’s room for improvement in my current understanding of Gibbs sampling and the like. The consensus in the problem design workshop seemed to be that machine learning must become more modular. Also there was agreement that the application of machine learning in the real world requires some magic for transforming the problem into “features” and some more magic for transforming the prediction into something useful. It was stating the obvious a bit, however not much progress has been made in this area of making ML more accessible. I wrote about choosing the right features before, but currently it’s more of an art than a science. One thing I took from the security workshop was that features must be easily constructed (most detection apps must run in real-time). This means we are interested in features the attacker can hardly influence (think received-headers in Spam-emails that can not be suppressed), yet they must be easily to compute.
Also really cool was the “NIPS Elevator Process” joke-paper about hungry scientists on the way to lunch (don’t confuse it with the Chinese restaurant process) and the party crashers at the Gatsby party. Sophie and some friends of hers simply joined the fun. The fun part was people taking her random answers for her research topic seriously
I got mistaken for one of the party crashers at one point, because I didn’t fit in with my clothing. I was actually planning on hitting a club in downtown Whistler, but didn’t get around to go in the end…
Posted in Data Mining, Machine Learning | Print | No Comments »
The GPL and Machine Learning Software - Should the GPL cover training data?
October 1, 2007 10:19 pm by Markus.
I’ve followed the discussion and introduction of the GPL v3 for a bit. One major change in the license is supposed to close the loophole commonly referred to as the “tivoization” of GPL software, i.e. mechanisms that prevent people from tinkering with the product they bought which includes GPL software. Tivo, in particular, accomplishes this by requiring a valid cryptographic signature for the code to run - the user has access to the code, but it’s of no use. One of the main ideas of the GPL was to allow people the freedom to tinker, improve and understand how something works. This got me thinking a bit about software that uses machine learning techniques.
For the sake of the argument, let’s assume that somebody releases a GPL version of a speech recognition system, or say an improved version of a GPL speech recognition system. While the algorithms would be in the open for everyone to see, two major components of speech recognition systems, the Acoustic Model and Language Model, do not have to be. The Acoustic Model is created by taking a very large number of audio recordings of speech and their transcriptions (Speech Corpus) and ‘compiling’ them into statistical representations of the sounds that make up each word. The Language Model is a very large file containing the probabilities of certain sequences of words in order to narrow down the search for what was said.
A big part of how well the speech recognition system works relies on the training. The author who improved upon the software should publish the training set as well. Otherwise people won’t be able to tinker with the system or understand why the software works well.
The same would hold for things like a handwriting recognition system. One could publish it along with a model (a bunch of numbers) that make the recognition work. It would be pretty hard for somebody to reverse engineer what the training examples were and how training was conducted to make the system work. Getting the training data is the expensive part in building speech-recognition and handwriting-recognition systems.
Think Spam-Assassin - what if the authors suddenly decide to not make their training corpus available anymore? How would users be able to determine the weights for the different tests?
I don’t think this case is covered by the GPL (v3 or older) - (However, I’m not a lawyer). Somebody could include the model in C code (i.e. define those weights for a Neural Net as a bunch of consts) and then argue that all is included to compile the program from scratch as per the requirements of the license. However, the source by itself wouldn’t allow anybody to understand or change (in a meaningful way) what the program is doing. With the growing importance of machine learning methods just being able to recompile something won’t be enough. I think this should be taken into consideration by the open source community for GPL v3.01.
Posted in Society, Machine Learning | Print | No Comments »
I passed my PhD defense
September 19, 2007 7:32 pm by Markus.
Hurray…
Posted in Machine Learning, Ramblings | Print | 2 Comments »
Advertising and Data Mining
August 30, 2007 1:44 pm by Markus.
Lately I’ve become fascinated with the field of Advertising and Marketing. Honestly I’ve never paid much attention to ads before as most of them were just not interesting to me or were just plain horrible. I just finished reading a good book about the choice of headlines and direct marketing and how it is actually tested what gets the most responses (”Tested Advertising Methods”, John Caples).
There’s obviously a lot that could be done with machine learning. For example, some kind of predictor that learns from previous ads (or possibly over ads from various companies) how successful they have been and then predicts return rates for new ads. Google and Yahoo probably do something like this already …
Posted in Advertising, Machine Learning | Print | No Comments »
Human Intuition vs. Statistical Models
August 25, 2007 10:01 pm by Markus.
I just came across a very interesting book announcement for “Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart” by Ian Ayres, a professor Yale Law School and econometrician. In the book (I haven’t read it yet, but I will) the author argues that intuition is losing ground to statistical methods and data mining. According to the Amazon abstract he gives examples from the airline industry, medical diagnostics and even online dating services showing that a statistical model will outperform human intuition.
That machines can outperform human judgement has been known for quite some time. For example, in the field of psychology the diagnosis of mental disorders is more or less standardized by them DSM. There was a very interesting meta-analysis that showed that a mechanical predictor always outperformed the human psychologist. To be specific: Grove, W.M., Zald, D.H., Hallberg, A.M., Lebow, B., Snitz, E., & Nelson, C. (2000). Clinical versus mechanical prediction: A meta-analysis. Psychological Assessment, 12, 19–30. To quote from the Abstract: “On average, mechanical-prediction techniques were about 10% more accurate than clinical predictions. Depending on the specific analysis, mechanical prediction substantially outperformed clinical prediction in 33%-47% of studies examined. Although clinical predictions were often as accurate as mechanical predictions, in only a few studies (6%-16%) were they substantially more accurate. Superiority for mechanical-prediction techniques was consistent, regardless of the judgment task, type of judges, judges’ amounts of experience, or the types of data being combined.”
I’m a little bit skeptical about using data crunching to decide important questions (as in life and death questions). In general it seems like a good idea, but it always comes down to how you model the data and how you model the question to be answered. In many cases this might be obvious, in others not so much. The art is then to model the data, not the application of the algorithm or technique. It reminds me a bit of a class about formal program verification I took back in Darmstadt. Stefan, the TA of the class, and I had an argument about the use of practicability of program verification. He gave the unix find utility as an example for which you can show - more or less - easily that the program will terminate while enumerating all the files in all the directories in the system, and how find can be nicely modeled with a well-founded relation to show the termination of the algorithm. I objected that I could set a symbolic link to a uper-level directory (which is why find does not by default follow them) and could make find go in circles. Stefan conceded, “Oh well, I guess then the model was wrong…”. Similar things have happened in e.g. Cryptography, where a finite-state model (sorry, lost the citation somewhere; I’m not quite sure if that was the Usenix paper from the Stanford guys I read or something else) showed that the SSL protocol (Secure Socket Layer) is secure. Later the protocol was broken nonetheless (Schneier, Bruce; Wagner, David; Analysis of the SSL 3.0 protocol).
I think that with the wrong model you can show a lot of good things about anything. Once you abstract from the real world and build a model you might just have ignored that little most important feature. Maybe it is time for a best-practices in data modeling and data mining (there are already some books out there for some specific domains) …
Posted in Data Mining, Machine Learning, Ramblings | Print | No Comments »
What Machine Learning Papers to read …
July 13, 2007 1:08 pm by Markus.
Laura just pointed me to this system, best described as:
I have a routine problem that sometimes paper titles are not enough to tell me what papers to read in recent conferences, and I often do not have time to read abstracts fully. This collection of scripts is designed to help alleviate the problem. Essentially, what it will do is compare what papers you like to cite with what new papers are citing. High overlap means the paper is probably relevant to you. Sure there are counter-examples, but overall I have found it useful (eg., it has suggested papers to me that are interesting that I would otherwise have missed). Of course, you should also read through titles since that is a somewhat orthogonal source of information.
http://www.cs.utah.edu/~hal/WhatToSee/
I have the same problem. And wow… I will have a lot to read this weekend.
Posted in Statistics, Machine Learning, Artificial Intelligence (AI) | Print | No Comments »
Back from ICML …
June 25, 2007 7:02 pm by Markus.
Just got back in town from ICML 2007, had a blast and met lots of old friends. This year it felt a bit more like a camping trip with no hot water and filthy bathrooms. Otherwise I learned a lot, specifically the following papers were in my opinion the most interesting (in no particular order):
- Pegasos: Primal Estimated sub-GrAdient SOlver for SVM (Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro):a very fast online SVM with good bounds, kernelizable. Code available. Most impressive results and probably useful for the robot stuff we are working on.
- A Kernel Path Algorithm for Support Vector Machines (Gang Wang, Dit-Yang Yeung, Frederick Lochovsky). Speed up SVM learning by not having to re-train when the Kernel Sigma is changed. I hope they will make some code available
- Restricted Boltzmann Machines for Collaborative Filtering (Ruslan Salakhutdinov, Andriy Mnih, Geoffrey Hinton). Now #4 at the Netflix Challenge. I already wrote in my AISTATS post that I think this technique has a lot of potential.
- Multi-Armed Bandit Problems with Dependant Arms (Sandeep Pandey, Deepayan Chakrabarti, Deepak Agarwal) A Clustering trick to distribute rewards and speedup reinforcement learning in instances of banner advertisings
- CarpeDiem: an Algorithm for the Fast Evaluation of SSL Classifiers (Roberto Esposito, Daniele Radicioni) A fast Viterbi
- Graph Clustering with Network Structure Indices (Matthew Rattigan, Marc Maier, David Jensen) Fast, simple graph-mining algorithms. Since I’m currently reading “Mining Graph Data”….
- A Dependance Maximization View of Clustering (Le Song, Alex Smola, Arthur Gretton, Karsten Borgwardt) An interesting, general framework for Clustering using the Hilbert-Schmidt Independence Criterion that makes many clustering algorithms (K-Means, Spectral Clustering, Hierarchical Clustering) mere special cases…
- Neighbor Search with Global Geometry: A Minimax Message Passing Algorithm (Kye-Hyeon Kim, Seungjin Choi). Interesting idea…
I just notice that my paper-list is exceptionally long this time. So I did get a lot of cool new things out of it; I’m glad I went. I will hopefully be able to try some of my ideas soon(mostly busy with thesis writing right now)…
Posted in Machine Learning | Print | No Comments »
Interesting Experimental Captchas
June 11, 2007 3:11 pm by Markus.
Captchas are these little word-puzzles in images that web-sites use to keep spammers and bots out. They are everywhere and even the New York Times had an article about Captchas recently. It turns out it’s a nice exercise in applying some machine learning to break these things (with lots of image manipulation to clean up the images). Since spam-bots are becoming smarter, people are switching to new kinds of Captchas. My favorites (using images) so far are Kittenauth and a 3D-rendered word-captcha.
Posted in spam, Machine Learning, Artificial Intelligence (AI), Security | Print | No Comments »
Choosing the right features for Data Mining
June 1, 2007 12:02 pm by Markus.
I’m fascinated by a common problem in data mining: how do you pick variables that are indicative of what you are trying to predict? It is simple for many prediction tasks; if you are predicting some strength of a material, you sample it at certain points using your using your understanding of physics and material sciences. It is more fascinating with problems that we would believe to understand, but don’t. For example, what makes a restaurant successful and guarantees repeat business? The food? The pricing? A friendly wait-staff? Turns out a very big indicator for restaurant success is the lighting. I admit that lighting didn’t make it into my top ten list… If you now consider what is asked on your average “are you satisfied with our service” questionnaire you can find in a various restaurant chains, then I don’t recall seeing anything about the ambiente on it. We are asking the wrong questions.
There are many other problems in real life just like this. I read a book called Blink and the point the author is trying to make is that making subconscious decisions are easy to make - once you know what to look for. More information is not always better. This holds for difficult problems such as judging the effectiveness of teachers (IIRC seeing the first 30 seconds of a videotape of him/her entering a classroom is as indicative as watching hours of recorded lectures). Same holds true for prediction problems about relationships - how can you predict if a couple will still be together 15 years later? Turns out there are four simple indicators to look for, and you can do it in maybe 2 minutes of watching a couple… The book is full of examples like that, but does not provide a way to “extract the right features”. I have similar problems with the criminology stuff I’m working on; while we get pretty good results using features suggested by the criminology literature I’m wondering if we have the right features. I’m still thinking that we could improve our results if we had more data - or the “right” data I should say (it should be obvious that more is not better by now). How do you pick the features for problems? Tricky question…
There is only data mining system that does not have this problems: recommender systems. Using recommender systems can avoid the problem as they do not rely on particular features to predict, but exploit correlations in “liking”. A classical example was that people that like classical music often like jazz as well - something you wouldn’t easily be able to predict from features you extract from the music. I wonder if we could reframe some prediction problems in ways more similar to recommender systems, or maybe make better use of meta-information in certain problems. What I mean with “meta-information” is easily explained with an example: Pagerank. It is so successful in web-scale information retrieval because it does not bother with trying to figure out if a page is relevant by keyword ratios and what not, but simply measuring the popularity by how many important pages link to it (before link-spam became a problem that is). I wish something simple like that would be possible for every problem ![]()
Posted in Data Mining, Machine Learning, Psychology | Print | 2 Comments »