You are currently browsing the Markus Breitenbach weblog archives for the day June 1, 2007 12:02 pm.
- Advertising (1)
- Artificial Intelligence (AI) (13)
- Classification (3)
- Clustering (1)
- Coding / Programming (8)
- Cryptography (1)
- Data Mining (19)
- Economy / Investing (1)
- ewrt linux (2)
- Fixing Stuff (8)
- Machine Learning (30)
- Math (2)
- Politics (3)
- Predictive Modeling (4)
- Psychology (3)
- Ramblings (26)
- Random (9)
- Security (15)
- Society (12)
- Sociology (4)
- spam (3)
- Statistics (15)
- July 11, 2010 8:56 pm: GraphLab & Parallel Machine Learning
- June 15, 2010 8:21 pm: PHP configuration using htaccess on 1and1 shared hosting
- February 28, 2010 12:21 pm: Energy efficient data mining algorithms
- February 16, 2010 11:56 pm: Alternative measures to the AUC for rare-event prognostic models
- January 26, 2010 9:54 pm: Spam Filtering by Learning a Pattern Language
- January 10, 2010 5:37 pm: Strong profiling is not mathematically optimal for discovering rare malfeasors (on rare event detection)
- November 13, 2009 12:27 am: Starcraft AI competition
- July 25, 2009 8:34 pm: Random characters in text mode -> graphics card
- June 7, 2009 5:04 pm: Programs stealing the input focus
- May 2, 2009 4:06 pm: Famous bugs in AI game engine caught on tape
Blogroll
Uncategorized
Useful Links
- July 2010
- June 2010
- February 2010
- January 2010
- November 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
Archive for June 1, 2007 12:02 pm
Choosing the right features for Data Mining
June 1, 2007 12:02 pm by Markus.
I’m fascinated by a common problem in data mining: how do you pick variables that are indicative of what you are trying to predict? It is simple for many prediction tasks; if you are predicting some strength of a material, you sample it at certain points using your using your understanding of physics and material sciences. It is more fascinating with problems that we would believe to understand, but don’t. For example, what makes a restaurant successful and guarantees repeat business? The food? The pricing? A friendly wait-staff? Turns out a very big indicator for restaurant success is the lighting. I admit that lighting didn’t make it into my top ten list… If you now consider what is asked on your average “are you satisfied with our service” questionnaire you can find in a various restaurant chains, then I don’t recall seeing anything about the ambiente on it. We are asking the wrong questions.
There are many other problems in real life just like this. I read a book called Blink and the point the author is trying to make is that making subconscious decisions are easy to make - once you know what to look for. More information is not always better. This holds for difficult problems such as judging the effectiveness of teachers (IIRC seeing the first 30 seconds of a videotape of him/her entering a classroom is as indicative as watching hours of recorded lectures). Same holds true for prediction problems about relationships - how can you predict if a couple will still be together 15 years later? Turns out there are four simple indicators to look for, and you can do it in maybe 2 minutes of watching a couple… The book is full of examples like that, but does not provide a way to “extract the right features”. I have similar problems with the criminology stuff I’m working on; while we get pretty good results using features suggested by the criminology literature I’m wondering if we have the right features. I’m still thinking that we could improve our results if we had more data - or the “right” data I should say (it should be obvious that more is not better by now). How do you pick the features for problems? Tricky question…
There is only data mining system that does not have this problems: recommender systems. Using recommender systems can avoid the problem as they do not rely on particular features to predict, but exploit correlations in “liking”. A classical example was that people that like classical music often like jazz as well - something you wouldn’t easily be able to predict from features you extract from the music. I wonder if we could reframe some prediction problems in ways more similar to recommender systems, or maybe make better use of meta-information in certain problems. What I mean with “meta-information” is easily explained with an example: Pagerank. It is so successful in web-scale information retrieval because it does not bother with trying to figure out if a page is relevant by keyword ratios and what not, but simply measuring the popularity by how many important pages link to it (before link-spam became a problem that is). I wish something simple like that would be possible for every problem ![]()
Posted in Predictive Modeling, Data Mining, Machine Learning, Psychology | Print | 2 Comments »