Archive for the ‘Classification’ Category

Detecting Click Fraud in Online Advertising

Sunday, March 2nd, 2014

JMLR has an interesting paper summarizing the results from a contest to build the best model for ClickFraud detection. The second place entry described some nice feature engineering that I found interesting. The first place did feature selection and then used gbm, a really good ensemble algorithm.

 

Classification with inputs that change over time – P2P Loan Data

Saturday, October 6th, 2012

Predicting whether a loan will default or not is a tricky task. It may involve many variables, incomplete information and is a task that involves time as a component. Loans may also perform for a while before they default. Some loans may even be late, but recover back to the regular payment schedule. It’s an interesting application for statistics.

The LendingClub website, a service offering peer-to-peer lending, offers an interesting data set: historical data of loan performance as well as data for new loans. I’ve been playing around a bit with the data and built a model to predict whether a loan is a good investment. The LendingClub data is available for download. A data dictionary can be found on the website also.

First we need to define the outcome we want to predict. A loan can be in several states, some being “current”, others being “defaulted”, “late” or even on a “performing payment plan”. Conservatively, I defined all loans that were not “paid off” as bad. Loans that are “current” were excluded as they still can default in the future. Loans that are “late” are considered bad, because the borrower run into problems. The model I’m trying to built is basically for a conservative investor looking for loans that will simply be paid back without a hitch. With the usual statistical techniques a model can be built and the performance can be measured by 10-fold cross-validation or evaluating the model on a hold-out set. The real result of a prediction will of course only be available after about 3 years when a loan is fully paid off. As measure to optimize I chose the AUC metric. A 10-fold cross-validation estimates the performance of my model at 0.698 which is not too bad. The predictions implicitly make a few assumptions. The first one being that future performance of loans will be similar to historical performance of similar loans. I’m assuming a stationary distribution and the IID assumption – which is not completely true in reality, but hopefully close enough 🙂 Also, inflation expectations were not taken into account, but I’m limiting my model to 36 month loans to make that more manageable.

I won’t go into the details of how I encoded the variables and what variables I’m using. I discovered that I can extract information out of the textual variables in the loans. The “Loan Description”, a free text field where potential borrowers can leave comments or answer questions, is quite predictive. The difficult part is using that information in practice. A loan is in “funding state” for two weeks were investors can ask questions and invest in the loan. Many loans get fully funded before the two week period is over, some without any question or comment on the loan. New information may become available in the Loan Description field that may change the classification. That means, however, that the prediction may change over time – positively or negatively – after an investment decision has been. Not ideal, but the variables are quite powerful so I’m still looking for a good solution.

I made the ratings for the LendingClub loans my program produces public. I will update them occasionally (i.e., whenever I feel like it). If you have some suggestions on how to use the textual variables, leave a comment.

Alternative measures to the AUC for rare-event prognostic models

Tuesday, February 16th, 2010

How can one evaluate the performance of prognostic models in a meaningful way? This is a very basic and yet an interesting problem especially in the context of prediction of very rare events (base-rates <10%). How reliable is the model’s forecast? This is a good question and of particular importance when it matters – think criminal psychology where models forecast the likelihood of recidivism for criminally insane people (Quinsey 1980). There are a variety of ways to evaluate a model’s predictive performance on a hold out sample, and some are more meaningful than others. For example, when using error-rates one should keep in mind that they are only meaningful when you consider the base-rate of your classes and the trivial classifier as well. Often this gets confusing when you are dealing with very imbalanced data sets or rare events. In this blog post, I’ll summarize a few techniques and alternative evaluation methods for predictive models that are particularly useful when dealing with rare events or low base-rates in general.

The Receiver Operator Characteristic is a graphical measure that plots the true versus false positive rates such that the user can decide where to cut for making the final classification decision. In order to summarize the performance of the graph in a single, reportable number, the area under the curve (AUC) is generally used.

(more…)

ISC on the Future of Anti-Virus Protection

Friday, August 1st, 2008

An article on the Internet Storm Center discusses wether Anti-Virus software in the current state is a dead end. In my opinion it has been dead for quite a while now. Apart from the absolutely un-usable state that anti-virus software is in, I think it’s protecting the wrong things. Most attacks (trojans, spyware) nowadays come through web-browser exploits and maybe instant-messenger (see reports on ISC). So instead of scanning incoming emails, how about a behavior blocker for the web-browser and the instant messenger? There are a couple of freeware programs (e.g. IEController [German]) out there that successfully put Internet Explorer, etc. into a sandbox; whatever Javascript exploit – known or unknown – the browser won’t be able to execute arbitrary files or write outside its cache-directory. Why is there nothing like that in the commercial AV packages?

However, a few possibilities suggested in the article might be worth exploring. For example, they suggest Bayesian heuristics to identify threats. Using machine learning techniques might be a direction worth exploring. IBM AntiVirus (maybe not the current version anymore) has been using Neural Networks with 4Byte sequences (n-grams) for bootsector virus detection.

A couple things to keep in mind, though:

  • Quality of the classifier (detection rate) should be measured with Area-under-ROC-Curve (AUC), not error-rate like most people tend to do in Spam-Filter comparisons. The base-rate of the “non-virus” class is pretty high; I have over 10.000 executables/libraries on my windows machine. All (most?) of them non-malicious.
  • The tricky part with that is the feature extraction. While sequences of bytes or strings extracted from a binary might be a good start, advanced features like call-graphs or imported API-calls should be used as well. This is pretty tricky and time-consuming, especially when it has to be done for different types of executables (Windows scripts, x86-EXE files, .Net files etc.). De-obfuscation techniques, just like in the signature based scanners, will probably be necessary before the features can be extracted.
  • Behavior blocking and sandboxes are probably easier, a better short-term fix, and more pro-active. This has been my experience with email-based attacks as well back in the Mydoom days when a special mime-type auto-executed an attachment in Outlook. Interestingly there are only two programs out there that sanitize emails (check mime-types, headers, rename executable attachments etc.) at the gateway-level – a much better pro-active approach than simply detecting known threats. The first is Mimedefang, a sendmail plugin. The other is impsec, based on procmail. CU Boulder was using impsec to help keep student’s machines clean (there were scalability issues with the procmail solution, though).

Debugging and Evaluating Predictive Models

Sunday, June 22nd, 2008

Speaking of the recent revelation that Moody’s mis-rated CPDOs due to a computer glitch (see also here) it is quite tricky to get models right from construction to production. Interestingly S&P, which gave identical ratings on many of them, stands by their independently developed model and makes me wonder how much models are tested in practice. This made me wonder if I would catch whatever bug while moving a model into production.

Most of the data-mining and modeling work is done in your favorite stats-package (SPSS, SAS etc.), but in production use people have to re-implement whichever equation they came up with to produce their risk-scores. When I’m doing Data Mining work I often use different tools from different authors or vendors to get the job done, as not every single one tool can do everything that I need. For example, I have had a lot of success with predictive modeling using Support Vector Machines. Some newer algorithms I have implemented in Matlab myself. That means I spend a lot of time converting data back and forth between different software (e.g. indicating a missing value). I usually do this with some Perl-scripts I wrote, but the entire process is error prone, especially given that the final model then has to be codified in some other language (Java, C, C#/.Net or whatever) so it can be incorporated into the projects software. It takes a while to get it right, because more often than not an error is not obvious (read: I had bad experiences with subtle errors during black-box testing). The following is my check-list for debugging the process (probably not complete to catch everything):

  • Do the results mean what you think they mean? What values for classification are good or bad? (big/small scores, or +1/-1 …)
  • Features: when exporting/importing the data, is the order of the features the same? Is the classification label in the right place? This is a lot of fun when you export stuff from SPSS/R/STATA to Matlab (which does not support named-columns in a matrix – better get all those indexes right)
  • Where missing values treated the right way when building the model? There are many ways to deal with them, and you might have either case of MAR (“missing at random”), MCAR (“missing completely at random”), NMAR (“not missing at random”), non-response and imputation (single and multiple) etc.
  • Did I deal with special values correctly? I’m not talking about the NULL value in the database, but “flag-values” such as 999 etc. to indicate a certain condition
  • When exporting/importing data, are special values (e.g. missing, flag and categorical values) handled correctly? Every program encodes them differently, especially when you use comma separated text (csv)
  • Is the scaling of the data the same? Are the scaled values of the new data larger/smaller than what the classifier was trained on? How are these cases handled?
  • Are the algorithm parameters the same, e.g. a kernel-sigma in a support vector model?
  • Is the new data from the same distribution? (When I hear “similar”, then it usually does not work in my experience 🙂 ) Check mean and variance for a start. Sometimes the difference can be subtle (e.g. a male-only model applied to females; this can work, depending on what was modeled). Was the data extracted the same way with the same SQL query? Was the query translated correctly into SQL? Was the result prepared the same way (recodes, scaling)?
  • In my code I check for each attribute if it is within the range of my training set. If not, then it’s either a bug (scaling?) or the model can’t reliably predict for this case.
  • Some simple test-cases, computed with your Stats-Package and your production code. I had a lot of success with White-Box tests in testing recode-tables etc.

As for the model evaluation I’ve read some reports in the past (not financial scorings, though) were testing was done on the training set. Obviously model quality should be assessed on a hold-out data set that has not been used for training or parameter tuning. Model quality in the machine learning community is still often evaluated using error-rate, but lately Area under the Receiver-Operator Characteristic has become popular (often abbreviated as AuROC, AUROC, AROC or ROC), which I found to be especially useful for imbalanced datasets. In Meteorology a lot of thought has been placed into the evaluation and comparison of the performance of different predictive models. Wilks Cost-Curves and Brier Skill-Scores look really interesting. In some models, although the predictor is trained on a dichotomous variable, is really predicting some risk over time – and should be evaluated using survival analysis (e.g. higher risk-scores should lead to sooner failure etc.). In survival analysis a different version of the AuROC is used called the concordance index. I’ll post some of my thoughts on all the evaluation scores some time in the future.