Interesting blog post on Differential Privacy. I wasn’t aware of this specific privacy model.
Archive for the ‘Data Mining’ Category
The UK is in the process of overhauling their overly stringent copyright laws. That’s an interesting development (see the Nature blog entry on the topic). One idea being discussed is to generally allow data and text mining without the copyright holders permission, which would usually be required for any kind of electronic processing.
Anand Rajaraman and Jeff Ullman wrote a book called Mining of Massive Datasets that can be downloaded for free (PDF, 340 pages, 2MB). It focuses on data mining of very large amounts of data that do not fit in main memory as found on the frequently on the web from an algorithmic point of view.
Interesting article: GraphLab: A New Framework for Parallel Machine Learning
From the abstract:
Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. We demonstrate the expressiveness of the GraphLab framework by designing and implementing parallel versions of belief propagation, Gibbs sampling, Co-EM, Lasso and Compressed Sensing. We show that using GraphLab we can achieve excellent parallel performance on large scale real-world problems.
Given all the talk about Map-Reduce, Hadoop etc. this seems like a logical next step to make scaling data mining to large data sets a lot easier.
I was a bit amused to read about this new algorithm that IBM research developed and that was sold as “energy efficient” in their press-release. This is good marketing, because the average journalist and reader might not understand the impact of the improvement. It just sounds a lot better to be green and save energy than to improve computational complexity…
How can one evaluate the performance of prognostic models in a meaningful way? This is a very basic and yet an interesting problem especially in the context of prediction of very rare events (base-rates <10%). How reliable is the model’s forecast? This is a good question and of particular importance when it matters – think criminal psychology where models forecast the likelihood of recidivism for criminally insane people (Quinsey 1980). There are a variety of ways to evaluate a model’s predictive performance on a hold out sample, and some are more meaningful than others. For example, when using error-rates one should keep in mind that they are only meaningful when you consider the base-rate of your classes and the trivial classifier as well. Often this gets confusing when you are dealing with very imbalanced data sets or rare events. In this blog post, I’ll summarize a few techniques and alternative evaluation methods for predictive models that are particularly useful when dealing with rare events or low base-rates in general.
The Receiver Operator Characteristic is a graphical measure that plots the true versus false positive rates such that the user can decide where to cut for making the final classification decision. In order to summarize the performance of the graph in a single, reportable number, the area under the curve (AUC) is generally used.
I just read another article discussing weather Risk Management tools had an impact on the current financial crisis. One of the most commonly used risk management measures is the Value-at-Risk (VaR) measure, a comparable measure that specifies a worst-case loss for some confidence interval. One of the major criticisms is (e.g. Nassim Nicholas Taleb, the author of the black swan) that the measure can be gamed. Risk can be hidden “in the rare event part” of the prediction and not surprisingly this seems to have happened.
Given that a common question during training with risk assessment software is “what do I do to get outcome/prediction x” from the software it should be explored how to safeguard in the software against users gaming the system. Think detecting multiple model evaluations with slightly changed numbers in a row…
Edit: I just found an instrument implemented as an Excel Spreadsheet. Good for prototyping something, but using that in practice is just asking people to fiddle with the numbers until the desired result is obtained. You couldn’t make it more user-friendly if you tried…
I just read that Credit Card Companies are adjusting Credit Scores based on shopping patterns in addition to credit-score and payment history. It seems they also consider which mortgage lender a customer uses and whether the customer owns a home in an area where housing prices are declining. All that to limit the growing credit card default rates.
That’s an interesting way to do it (from a risk modeling point of view) and I wonder how well it works in practice. There might also be some legal ramifications to this if it can be demonstrated that this practice (possibly unknowingly to them) discriminates e.g. against minorities.
I had written a post about the issues of converting models into something that is usable in production environments as most stats-packages don’t have friendly interfaces to integrate them into the flow of processing data. I worked on a similar problem involving a script written in SAS recently. To be specific, some code for computing a risk-score in SAS had to be converted into Java and I was confronted with having to figure out the semantics of SAS code. I found a software to convert SAS code into Java and I have to say I was quite impressed with how well it worked. Converting one language (especially one for which there is no published grammar or other specification) into another is quite a task – after a bit of back and forth with support we got our code converted and the Java code worked on the first try. I wish there would be similar converter for STATA, R and SPSS