How can one evaluate the performance of prognostic models in a meaningful way? This is a very basic and yet an interesting problem especially in the context of prediction of very rare events (base-rates <10%). How reliable is the model’s forecast? This is a good question and of particular importance when it matters – think criminal psychology where models forecast the likelihood of recidivism for criminally insane people (Quinsey 1980). There are a variety of ways to evaluate a model’s predictive performance on a hold out sample, and some are more meaningful than others. For example, when using error-rates one should keep in mind that they are only meaningful when you consider the base-rate of your classes and the trivial classifier as well. Often this gets confusing when you are dealing with very imbalanced data sets or rare events. In this blog post, I’ll summarize a few techniques and alternative evaluation methods for predictive models that are particularly useful when dealing with rare events or low base-rates in general.
The Receiver Operator Characteristic is a graphical measure that plots the true versus false positive rates such that the user can decide where to cut for making the final classification decision. In order to summarize the performance of the graph in a single, reportable number, the area under the curve (AUC) is generally used.
The AUC has many nice properties, such as being base-rate independent and being equivalent to some other mathematically nice statistics like Mann-Whitney U-statistic that measures the probability of ranking a negative example higher than a positive one.
There are a few drawbacks to the AUC: for one, it also measures areas (cut points) that nobody would really consider to use for their classification model, i.e., towards the extreme ends on either side of the ROC curve. In the case of very imbalanced data sets, it is very sensitive to small changes in classification performance as well. Re-ranking of one case might lead to a significant change in the AUC. The sensitivity of the AUC is fairly obvious, when one sees it as measuring the pair-wise comparisons in rank for positive and negative examples. If there are very few of either class, it leads to a significant decrease in comparisons and hence to a higher sensitivity to small changes in the ranking of items.
Let us use a more concrete example for a rare event prognostic model and use a model from the criminal justice context for which evaluating the model’s performance is important. The example I’m using here is a violent felony offense of a parolee, a event that is (thankfully) fairly rare and only happens with about five percent of the parole population. So lets say we collected 3300 cases with two years of follow-up (time from release to first-offense or end-of-study) and had 177 failures (VFO, or violent offenses). First, note that this is a rather large sample with a large follow-up time – I’ve seen models that were built on far fewer cases with a much shorter follow-up period. Note that the follow-up time influences the base-rate in our setup – the longer the time-frame, the higher the portion of people showing recidivism (until it stabilizes). For this example, we build a simple logistic regression model predicting the outcome (class label) using a few standard predictors such as age, violent history and so on.
First, let’s evaluate the model. Finding an accuracy rate on the hold-out set of 94.7% and an AUC of 0.78, we could be content and decide to use the model. Not so fast. First, the base rate is 5.3%, so our classifier is predicting exactly the baseline. It is not a trivial classifier, however, since the AUC is above 0.5 indicating a non-random predictor. If the classifier would always predict the same class, the AUC would be 0.5. So the model is doing something right.
The AUC of 0.78 even is fairly good. Other models I have seen perform around 0.7. I have seen some models build on only 300 cases. Given that we are dealing with a rare event and had 3300 cases, 2 years of follow-up, and 177 failures (violent offenses), we have a well constructed study.
So what could possibly go wrong?
As we can see in the left plot, the curve is pretty flat in the range of 0.8 on the x-axis. That is because there are no cases there. This information got lost when only reporting the AUC. Large differences between two ROC curves can translate into very small differences in AUCs (Dwyer, 1996), especially if the space of values exhibited by the risk-scale is not completely continuous due to your sample.
There are other attributes of prediction quality that should be considered before putting a model into production:
- Resolution: the ability of the model to resolve the cases into classes of varying risk. There should be a difference between the overall base-rate and base-rates in the risk classes
- Reliability (calibration): there should be agreement between the predicted and observed probabilities for success and failure. You could also take this as a reliable measure of confidence in the predictions.
- Sharpness: the probabilities should be spread out and not clumped around the mean. Not all the cases should have 0 or 1 probability of failure.
A good way to examine resolution of a risk scale is to use survival analysis and examine the probability of failure for each risk group. In the example, we split the risk scores into three groups indicating low, medium, and high risk of failure.
In the plot we can see that there is very little discrimination between the medium and high groups at the 90 day mark. A better model would have no overlap in the confidence intervals at any point in time and would show a clear separation in risk for all groups.
The reliability plot shows the predicted probabilities compared to the observed probabilities by binning the cases into groups of probabilities. For example, all the cases with about 0.05 probability of failure go into one bin. The cases in the bin are then plotted against the observed probabilities. A good and reliable predictor would have the red line close to the diagonal (predicted probabilities are close to the observed ones).
If the model had better sharpness, we would see a good spread in the histogram in the lower right corner. In this case most of our cases have low risk of failure, because the event is rare. A more ideal model would show better spread of the probabilities for the cases at elevated risk of failure.
The attribute plot is similar to the reliability plot, but shows confidence intervals for the predicted and observed probability bins. In addition, there are two lines indicating a model with no skill (not predicting better than the trivial classifier) and no resolution (predicting the baseline probability).
We can see that although our model performs well for many of the lower probability classes it does perform fairly poorly for the higher probability cases. This is again probably due to the lack of cases in this region as we have seen from the sharpness histogram.
This is a summary of a talk that William Dieterich, Tim Brennan, and I presented in a stats-session at the 2009 ASC conference. The second half of the talk will appear on this blog eventually.
References and further reading
Briggs, W. M. & Zaretzki, R. (2008). The skill plot: A graphical technique for evaluating continuous diagnostic tests. Biometrics, 63, 250 – 261.
Dwyer, A.J. (1996). In pursuit of a piece of the ROC. Radiology, 201, 621 – 625.
Hsu, W., & Murphy, A. H. (1986). The attributes diagram: A geometrical framework for assessing the quality of probability forecasts. International Journal of Forecasting, 2, 1638 – 1658.
Murphy, A.H. (1993). What is a good forecast? An essay on the nature of goodness in weather forecasting. Weather and Forecasting, 8, 281 – 293.
Quinsey, V. L. (1980). The baserate problem and the prediction of dangerousness: A reappraisal. The Journal of Law and Psychiatry, 329 – 340.
The plots in this blog post were made using the NCAR and Epi packages in R.