- Markus Breitenbach - http://blog.markus-breitenbach.com -
Debugging and Evaluating Predictive Models
Posted By Markus On June 22, 2008 5:05 pm @ 5:05 pm (June 22, 2008) In Classification, Data Mining | 1 Comment
Speaking of the recent revelation that [1] Moody’s mis-rated CPDOs due to a computer glitch (see also [2] here) it is quite tricky to get models right from construction to production. Interestingly S&P, which gave identical ratings on many of them, [3] stands by their independently developed model and makes me wonder how much models are tested in practice. This made me wonder if I would catch whatever bug while moving a model into production.
Most of the data-mining and modeling work is done in your favorite stats-package (SPSS, SAS etc.), but in production use people have to re-implement whichever equation they came up with to produce their risk-scores. When I’m doing Data Mining work I often use different tools from different authors or vendors to get the job done, as not every single one tool can do everything that I need. For example, I have had a lot of success with predictive modeling using Support Vector Machines. Some newer algorithms I have implemented in Matlab myself. That means I spend a lot of time converting data back and forth between different software (e.g. indicating a missing value). I usually do this with some Perl-scripts I wrote, but the entire process is error prone, especially given that the final model then has to be codified in some other language (Java, C, C#/.Net or whatever) so it can be incorporated into the projects software. It takes a while to get it right, because more often than not an error is not obvious (read: I had bad experiences with subtle errors during black-box testing). The following is my check-list for debugging the process (probably not complete to catch everything):
As for the model evaluation I’ve read some reports in the past (not financial scorings, though) were testing was done on the training set. Obviously model quality should be assessed on a hold-out data set that has not been used for training or parameter tuning. Model quality in the machine learning community is still often evaluated using error-rate, but lately Area under the [4] Receiver-Operator Characteristic has become popular (often abbreviated as AuROC, AUROC, AROC or ROC), which I found to be especially useful for imbalanced datasets. In Meteorology a lot of thought has been placed into the [5] evaluation and comparison of the performance of different predictive models. Wilks Cost-Curves and Brier Skill-Scores look really interesting. In some models, although the predictor is trained on a dichotomous variable, is really predicting some risk over time - and should be evaluated using survival analysis (e.g. higher risk-scores should lead to sooner failure etc.). In survival analysis a different version of the AuROC is used called the concordance index. I’ll post some of my thoughts on all the evaluation scores some time in the future.
Article printed from Markus Breitenbach: http://blog.markus-breitenbach.com
URL to article: http://blog.markus-breitenbach.com/2008/06/22/debugging-and-evaluating-predictive-models/
URLs in this post:
[1] Moody’s mis-rated CPDOs due to a computer glitch: http://www.ft.com/cms/s/0/0c82561a-2697-11dd-9c95-000077b07658.html?nclick_check
=1
[2] here: http://www.nytimes.com/2008/04/27/magazine/27Credit-t.html?_r=2&oref=slogin&
amp;pagewanted=print
[3] stands by their independently developed model: http://calculatedrisk.blogspot.com/2008/05/which-ratings-model-is-broken.html
[4] Receiver-Operator Characteristic: http://en.wikipedia.org/wiki/Receiver_operating_characteristic
[5] evaluation and comparison of the performance of different predictive models: http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.html#Methods%20f
or%20dichotomous%20forecasts
Click here to print.