Archive for the ‘Predictive Modeling’ Category

Detecting Click Fraud in Online Advertising

Sunday, March 2nd, 2014

JMLR has an interesting paper summarizing the results from a contest to build the best model for ClickFraud detection. The second place entry described some nice feature engineering that I found interesting. The first place did feature selection and then used gbm, a really good ensemble algorithm.

 

Risk Assessment of Rare Events in adversarial Scenarios

Tuesday, June 21st, 2011

The RAND corporation just published an interesting paper exploring the benefits of using risk prediction to reduce the screening required at airports. You might have noticed various attempts to establish some kind of fast-lane or trusted traveler program. Obvious this is a very sensitive topic and probably hard to get right. Screening certain groups of the population more than others (“profiling”) is generally frowned upon and also not a good idea in general (see “Strong profiling is not mathematically optimal for discovering rare malfeasors on rare event detection“), but what hasn’t been examined much is identifying people that can be considered more “safe” than others. The paper explores that idea and shows that even under the assumption that the bad guys will try and subvert this program that there can be benefits to implementing this solution. The paper is a bit sparse on mathematical details. Certainly an interesting idea, though.

Paper: Assessing the Security Benefits of a Trusted Traveler Program in the Presence of Attempted Attacker Exploitation and Compromise

Adversarial Scenarios in Risk Mismanagement

Sunday, January 11th, 2009

I just read another article discussing weather Risk Management tools had an impact on the current financial crisis. One of the most commonly used risk management measures is the Value-at-Risk (VaR) measure, a comparable measure that specifies a worst-case loss for some confidence interval. One of the major criticisms is (e.g. Nassim Nicholas Taleb, the author of the black swan) that the measure can be gamed. Risk can be hidden “in the rare event part” of the prediction and not surprisingly this seems to have happened.

Given that a common question during training with risk assessment software is “what do I do to get outcome/prediction x” from the software it should be explored how to safeguard in the software against users gaming the system. Think detecting multiple model evaluations with slightly changed numbers in a row…

Edit: I just found an instrument implemented as an Excel Spreadsheet. Good for prototyping something, but using that in practice is just asking people to fiddle with the numbers until the desired result is obtained. You couldn’t make it more user-friendly if you tried…

Deploying SAS code in production

Saturday, November 1st, 2008

I had written a post about the issues of converting models into something that is usable in production environments as most stats-packages don’t have friendly interfaces to integrate them into the flow of processing data. I worked on a similar problem involving a script written in SAS recently. To be specific, some code for computing a risk-score in SAS had to be converted into Java and I was confronted with having to figure out the semantics of SAS code. I found a software to convert SAS code into Java and I have to say I was quite impressed with how well it worked. Converting one language (especially one for which there is no published grammar or other specification) into another is quite a task – after a bit of back and forth with support we got our code converted and the Java code worked on the first try. I wish there would be similar converter for STATA, R and SPSS 🙂

Debugging and Evaluating Predictive Models

Sunday, June 22nd, 2008

Speaking of the recent revelation that Moody’s mis-rated CPDOs due to a computer glitch (see also here) it is quite tricky to get models right from construction to production. Interestingly S&P, which gave identical ratings on many of them, stands by their independently developed model and makes me wonder how much models are tested in practice. This made me wonder if I would catch whatever bug while moving a model into production.

Most of the data-mining and modeling work is done in your favorite stats-package (SPSS, SAS etc.), but in production use people have to re-implement whichever equation they came up with to produce their risk-scores. When I’m doing Data Mining work I often use different tools from different authors or vendors to get the job done, as not every single one tool can do everything that I need. For example, I have had a lot of success with predictive modeling using Support Vector Machines. Some newer algorithms I have implemented in Matlab myself. That means I spend a lot of time converting data back and forth between different software (e.g. indicating a missing value). I usually do this with some Perl-scripts I wrote, but the entire process is error prone, especially given that the final model then has to be codified in some other language (Java, C, C#/.Net or whatever) so it can be incorporated into the projects software. It takes a while to get it right, because more often than not an error is not obvious (read: I had bad experiences with subtle errors during black-box testing). The following is my check-list for debugging the process (probably not complete to catch everything):

  • Do the results mean what you think they mean? What values for classification are good or bad? (big/small scores, or +1/-1 …)
  • Features: when exporting/importing the data, is the order of the features the same? Is the classification label in the right place? This is a lot of fun when you export stuff from SPSS/R/STATA to Matlab (which does not support named-columns in a matrix – better get all those indexes right)
  • Where missing values treated the right way when building the model? There are many ways to deal with them, and you might have either case of MAR (“missing at random”), MCAR (“missing completely at random”), NMAR (“not missing at random”), non-response and imputation (single and multiple) etc.
  • Did I deal with special values correctly? I’m not talking about the NULL value in the database, but “flag-values” such as 999 etc. to indicate a certain condition
  • When exporting/importing data, are special values (e.g. missing, flag and categorical values) handled correctly? Every program encodes them differently, especially when you use comma separated text (csv)
  • Is the scaling of the data the same? Are the scaled values of the new data larger/smaller than what the classifier was trained on? How are these cases handled?
  • Are the algorithm parameters the same, e.g. a kernel-sigma in a support vector model?
  • Is the new data from the same distribution? (When I hear “similar”, then it usually does not work in my experience 🙂 ) Check mean and variance for a start. Sometimes the difference can be subtle (e.g. a male-only model applied to females; this can work, depending on what was modeled). Was the data extracted the same way with the same SQL query? Was the query translated correctly into SQL? Was the result prepared the same way (recodes, scaling)?
  • In my code I check for each attribute if it is within the range of my training set. If not, then it’s either a bug (scaling?) or the model can’t reliably predict for this case.
  • Some simple test-cases, computed with your Stats-Package and your production code. I had a lot of success with White-Box tests in testing recode-tables etc.

As for the model evaluation I’ve read some reports in the past (not financial scorings, though) were testing was done on the training set. Obviously model quality should be assessed on a hold-out data set that has not been used for training or parameter tuning. Model quality in the machine learning community is still often evaluated using error-rate, but lately Area under the Receiver-Operator Characteristic has become popular (often abbreviated as AuROC, AUROC, AROC or ROC), which I found to be especially useful for imbalanced datasets. In Meteorology a lot of thought has been placed into the evaluation and comparison of the performance of different predictive models. Wilks Cost-Curves and Brier Skill-Scores look really interesting. In some models, although the predictor is trained on a dichotomous variable, is really predicting some risk over time – and should be evaluated using survival analysis (e.g. higher risk-scores should lead to sooner failure etc.). In survival analysis a different version of the AuROC is used called the concordance index. I’ll post some of my thoughts on all the evaluation scores some time in the future.

Choosing the right features for Data Mining

Friday, June 1st, 2007

I’m fascinated by a common problem in data mining: how do you pick variables that are indicative of what you are trying to predict? It is simple for many prediction tasks; if you are predicting some strength of a material, you sample it at certain points using your using your understanding of physics and material sciences. It is more fascinating with problems that we would believe to understand, but don’t. For example, what makes a restaurant successful and guarantees repeat business? The food? The pricing? A friendly wait-staff? Turns out a very big indicator for restaurant success is the lighting. I admit that lighting didn’t make it into my top ten list… If you now consider what is asked on your average “are you satisfied with our service” questionnaire you can find in a various restaurant chains, then I don’t recall seeing anything about the ambiente on it. We are asking the wrong questions.

There are many other problems in real life just like this. I read a book called Blink and the point the author is trying to make is that making subconscious decisions are easy to make – once you know what to look for. More information is not always better. This holds for difficult problems such as judging the effectiveness of teachers (IIRC seeing the first 30 seconds of a videotape of him/her entering a classroom is as indicative as watching hours of recorded lectures). Same holds true for prediction problems about relationships – how can you predict if a couple will still be together 15 years later? Turns out there are four simple indicators to look for, and you can do it in maybe 2 minutes of watching a couple… The book is full of examples like that, but does not provide a way to “extract the right features”. I have similar problems with the criminology stuff I’m working on; while we get pretty good results using features suggested by the criminology literature I’m wondering if we have the right features. I’m still thinking that we could improve our results if we had more data – or the “right” data I should say (it should be obvious that more is not better by now). How do you pick the features for problems? Tricky question…

There is only data mining system that does not have this problems: recommender systems. Using recommender systems can avoid the problem as they do not rely on particular features to predict, but exploit correlations in “liking”. A classical example was that people that like classical music often like jazz as well – something you wouldn’t easily be able to predict from features you extract from the music. I wonder if we could reframe some prediction problems in ways more similar to recommender systems, or maybe make better use of meta-information in certain problems. What I mean with “meta-information” is easily explained with an example: Pagerank. It is so successful in web-scale information retrieval because it does not bother with trying to figure out if a page is relevant by keyword ratios and what not, but simply measuring the popularity by how many important pages link to it (before link-spam became a problem that is). I wish something simple like that would be possible for every problem 🙂