Choosing the right features for Data Mining

I’m fascinated by a common problem in data mining: how do you pick variables that are indicative of what you are trying to predict? It is simple for many prediction tasks; if you are predicting some strength of a material, you sample it at certain points using your using your understanding of physics and material sciences. It is more fascinating with problems that we would believe to understand, but don’t. For example, what makes a restaurant successful and guarantees repeat business? The food? The pricing? A friendly wait-staff? Turns out a very big indicator for restaurant success is the lighting. I admit that lighting didn’t make it into my top ten list… If you now consider what is asked on your average “are you satisfied with our service” questionnaire you can find in a various restaurant chains, then I don’t recall seeing anything about the ambiente on it. We are asking the wrong questions.

There are many other problems in real life just like this. I read a book called Blink and the point the author is trying to make is that making subconscious decisions are easy to make – once you know what to look for. More information is not always better. This holds for difficult problems such as judging the effectiveness of teachers (IIRC seeing the first 30 seconds of a videotape of him/her entering a classroom is as indicative as watching hours of recorded lectures). Same holds true for prediction problems about relationships – how can you predict if a couple will still be together 15 years later? Turns out there are four simple indicators to look for, and you can do it in maybe 2 minutes of watching a couple… The book is full of examples like that, but does not provide a way to “extract the right features”. I have similar problems with the criminology stuff I’m working on; while we get pretty good results using features suggested by the criminology literature I’m wondering if we have the right features. I’m still thinking that we could improve our results if we had more data – or the “right” data I should say (it should be obvious that more is not better by now). How do you pick the features for problems? Tricky question…

There is only data mining system that does not have this problems: recommender systems. Using recommender systems can avoid the problem as they do not rely on particular features to predict, but exploit correlations in “liking”. A classical example was that people that like classical music often like jazz as well – something you wouldn’t easily be able to predict from features you extract from the music. I wonder if we could reframe some prediction problems in ways more similar to recommender systems, or maybe make better use of meta-information in certain problems. What I mean with “meta-information” is easily explained with an example: Pagerank. It is so successful in web-scale information retrieval because it does not bother with trying to figure out if a page is relevant by keyword ratios and what not, but simply measuring the popularity by how many important pages link to it (before link-spam became a problem that is). I wish something simple like that would be possible for every problem 🙂

2 Responses to “Choosing the right features for Data Mining”

  1. Interesting post.
    However, I would tend to disagree on the last paragraph, for at least two reasons:
    1) you can see a recommender system as a prediction problem with features: indeed, what you are trying to predict is whether someone will like a certain kind of music and your features are the id of the person and the style of music. So you can consider that your data consists of 2 features (which can take their values in large but finite sets). The specificity is that you use some sort of transitive relation to build your model (eg. if you have (user=123, music=jazz, +) and (user=123, music=classical, +) and (user=245, music=jazz, +) in your database, you can predict a + for (user=245, music=classical))
    2) you can probably do a better recommendation if you use features. Indeed, in the case you do not have an exact match, the recommender system may not do a good job, while if you had some additional features or information, you might solve the problem. Let me give an example: if instead of music style you try to predict whether someone likes a specific artist/band. You can find out that some people who like Mozart also like Duke Ellington, and you have a user who likes Charlie Parker. The question is whether this person likes Mozart. With a purely link-based recommender system, you cannot really answer, but if you have the additional feature “music style” you might recognize that Charlie Parker and Duke Ellington are jazz artists and thus draw a connection (of course if you had much more data you might also find out this connection automatically, but this is just an example of how data can substitute for prior knowledge). So the additional features (which introduce a structure or distance on your initial features [user, artist]) may provide useful information that helps you make better decisions with little data.
    3) any prediction problem from features can be converted into a generalized recommender problem: indeed, if you have a problem with features X,Y,Z,T you can simply consider for example the pair (X,Y) as the ‘user’ and the pair (Z,T) as the ‘product’. You may as well consider any other split into two groups of your features. You can also consider splits into 3 groups or more (thus generalizing the concept), and it is relatively easy to generalize some of the algorithms used in recommender systems to more than 2 ‘matrix dimensions’. Yet another way would be to split into user X, with feature Y and ‘product’ Z with feature T…

    So as a conclusion, I would really not distinguish recommender systems and vector-based prediction problems as they can be converted into one another, so I do not think there is any way to ‘escape’ the need of finding right features!

  2. This is indeed an interesting subject, both from intellectual and practical perspectives. I’ve been using two heuristics borrowed from Weiss and Indurkhya’s “Predictive Data Mining”.

    The first step excludes variables which seem obviously useless. Sometimes this step removes few variables, but sometimes it removes many, which helps reduce the computational workload down-stream.

    The second step uses a simple Mahalanobis-like distance between classes to gauge the predictive power of any given set of predictors. I hooked up a genetic algorithm to this heuristic, to search for optimal sets of predictors.

Leave a Reply

You must be logged in to post a comment.