Valentine’s day tango

February 15th, 2009

EN TUS BRAZOS from et su on Vimeo.

Adversarial Scenarios in Risk Mismanagement

January 11th, 2009

I just read another article discussing weather Risk Management tools had an impact on the current financial crisis. One of the most commonly used risk management measures is the Value-at-Risk (VaR) measure, a comparable measure that specifies a worst-case loss for some confidence interval. One of the major criticisms is (e.g. Nassim Nicholas Taleb, the author of the black swan) that the measure can be gamed. Risk can be hidden “in the rare event part” of the prediction and not surprisingly this seems to have happened.

Given that a common question during training with risk assessment software is “what do I do to get outcome/prediction x” from the software it should be explored how to safeguard in the software against users gaming the system. Think detecting multiple model evaluations with slightly changed numbers in a row…

Edit: I just found an instrument implemented as an Excel Spreadsheet. Good for prototyping something, but using that in practice is just asking people to fiddle with the numbers until the desired result is obtained. You couldn’t make it more user-friendly if you tried…

Credit Card companies adjusting Credit Scores

December 22nd, 2008

I just read that Credit Card Companies are adjusting Credit Scores based on shopping patterns in addition to credit-score and payment history. It seems they also consider which mortgage lender a customer uses and whether the customer owns a home in an area where housing prices are declining. All that to limit the growing credit card default rates.

That’s an interesting way to do it (from a risk modeling point of view) and I wonder how well it works in practice. There might also be some legal ramifications to this if it can be demonstrated that this practice (possibly unknowingly to them) discriminates e.g. against minorities.

Kids, Games and Sociopaths

December 2nd, 2008

In an interview of Professor Marc Bekoff on the radio he described some of the research on sociopathy and playing games. Research indicates that sociopaths don’t play and never learned how to play with other people. A lot of learning is done by playing, boundaries and rules are established and kids learn how to get along with other kids. And then you have a school ban the game of tag and other chase games and even a Virginia School banning all touching between kids. I wonder if the school thought about the consequences of these policies. The kids might turn out like the one in this story illustrating the negative effects of child fear mongering and overprotective parenting. What are kids in Virginia supposed to do? I guess we should have more video games in school then 🙂 (sidenode: see player quits World of Warcraft (WoW) as an extreme example for how excessive video games can mess up lives). Interesting times indeed…

Back from Conference of the American Society of Criminology (ASC 2008)

November 21st, 2008

I just got back from ASC 2008 (Conference of the American Society of Criminology). It’s the main conference for everything in criminology and has a wide international attendance. This was the first conference of this kind I attended and it was quite different from what I’m used to. There were more than 20 tracks – yep,20 talks going on at the same time. It’s impossible to pick and choose; the program was a book with a few hundred pages containing only titles and names (no abstracts) of the sessions and talks. Wow… But still way too many talks. I think the conference would be better if there would be a review process of the abstracts as some of the talks didn’t quite match the advertised title.

However, from the sessions I attended about two thirds of the presenters fail to show up. In one particular case I was interested in seeing a talk critical about an psychometric instrument I have worked with and the presenters bailed despite that we saw them in the morning in the conference hotel. That’s something I haven’t seen happen in computer science conferences at all. Some of the studies presented were a bit funny (small sample, no hold-out set etc.). Overall I got one new idea out of it that could turn out to be interesting: a diversity measure for static recidivism risk models.

Unfortunately St. Louis was a bit boring. It has pretty parks, but e.g. Tango dancing ends at 11pm (2am in Denver – at the earliest). Oh well…

Deploying SAS code in production

November 1st, 2008

I had written a post about the issues of converting models into something that is usable in production environments as most stats-packages don’t have friendly interfaces to integrate them into the flow of processing data. I worked on a similar problem involving a script written in SAS recently. To be specific, some code for computing a risk-score in SAS had to be converted into Java and I was confronted with having to figure out the semantics of SAS code. I found a software to convert SAS code into Java and I have to say I was quite impressed with how well it worked. Converting one language (especially one for which there is no published grammar or other specification) into another is quite a task – after a bit of back and forth with support we got our code converted and the Java code worked on the first try. I wish there would be similar converter for STATA, R and SPSS 🙂

Photo-based CAPTCHAs

October 15th, 2008

I found an interesting article about solving photo-based CAPTCHAs (tell cats from dogs etc.). Turns out machine learning is quite capable of solving the Asirra CAPTCHA (apart from simply following the adopt-me link and deducing from the text whether it’s a cat or a dog).

Computer Models and the Mortgage Crisis

September 28th, 2008

Interesting article in a NYT blog: How Wall Street Lied to Its Computers

Can statistical models be intellectual property?

September 1st, 2008

Recently I had a fun discussion with Bill over lunch about intellectual property and how that might apply to statistical modeling work. Given that there are more and more companies making a living from forming predictions with a model they have built (churn-prediction, credit-scores and other risk-models) we were wondering if there were any means of protecting them as intellectual property. For example, the ZETA-model for predicting corporate bankruptcies is a closely guarded secret with having published only the variables being used (Altman E. I. (2000); Predicting financial distress for companies: revisiting the Z-Score and ZETA models). Obviously this model is useful for lending and can make serious money for the user. Making decisions guided by a formula is becoming more popular. This might be something over which legal battles will be fought in the future.

Copyrighted works and patents often count towards what a company would be worth should somebody acquire it. This means there would be motivation for start-up companies to protect their models. A mathematical formula (e.g. a regression equation) cannot be patented, and copyright probably won’t apply either; even if copyright would apply, it’s trivial to build a formula that does essentially the same thing (e.g. multiply all the weights in the formula by 10). This leaves only trade secret protection and means there is no recourse once the cat is out of the bag. Often it’s also the data-collection method that is kept secret – a company called Epagogix developed a method to judge the success of movies from a script by scoring it against some scales that they keep secret.

Currently, I don’t see any legal protections with the exception of trade-secrets for this. And given that there is infinitely many ways to express the same scoring rules in a different way, this would be a fairly hard problem for lawyers and politicians to formulate sensible rules for establishing protection for this kind of intellectual property.

Taxons, Taxometrics and the Number of Clusters

August 21st, 2008

In a survey-paper various methods for finding the number of clusters were compared (Dimitriadou et.al, An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets, Psychometrika, 2002) – and there are plenty of methods. None of them work all the time. Finding the right number of clusters has been an open problem for quite a while and also depends on the application, e.g. if more fine or course grained clusters are of interest.

A similar problem occurs in psychopathology. Imagine some measurements taken from several people – some with and some without a mental illness. The question then becomes: are there two clusters or just one? Is the data simply continuous or generated by a latent Bernoulli distribution? There is a whole bunch of literature out there dealing with the same problem from the psychology standpoint (for example: Schmidt, Kotov, Joiner, “Taxometrics – Towards a New Diagnostic Scheme for Psychopathology“, American Psychological Association) One of the more famous researchers is Paul Meehl, who developed a couple of methods to detect a taxon in data. The MAXCOV-HITMAX invented by Paul Meehl is for the detection of latent taxonic structures (i.e., structures in which the latent variable is not continuously, but rather Bernoulli, distributed).

My problem with Meehl’s methods (MAMBAC, MAXCOV, MAXEIG etc.) is that in all the articles only an intuitive explanation is given. Despite being a mathematical method there were no clear definitions of what the method will consider to be a taxon, or any necessary/sufficient conditions on when the algorithm will detect a taxon. Zoologists for example have entire conferences on how to classify species and go through a lot of painful details on how to properly classify species. They have, it seems, endless debates on what constitutes a new species in the taxonomy. However, I still wasn’t able to find a mathematical definition of what constitutes a taxon.

In addition to that, there seems to be some problems when using MAXCOV with dichotomous indicators (Maruan et.al, An Analysis of Meehl’s MAXCOV-HITMAX Procedure for the Case of Dichotomous Indicators, Multivariate Behavioral Research, Vol. 38, Issue 1 Jan. 2003); in this article they pretty much take the entire procedure apart and show that it often fails to indicate taxons when they are there or indicates taxons when there is nothing.

I think the question of finding a taxon is strongly related to clustering, because it simply tries to answer if clusters exist in the data. However, from all the clustering literature I’ve read so far, clusters are generally defined as dense areas in a space and are found in various ways by maximizing or minimizing some criterion (mutual information etc.). What constitutes a cluster is often conveniently defined so it fits the algorithm at hand. And then you still have to deal with or at least acknowledge the fact that the current notion of clustering has been proven to be impossible (An Impossibility Theorem for Clustering; Kleinberg; NIPS 15).

In a new paper in Machine Learning called Generalization from Observed to Unobserved Features by Clustering (Krupka&Tishby; JMLR 9(Mar):339–370, 2008) the authors describe an idea that might change the way we view clustering. In the paper they show that (under certain conditions) given a clustering based on some features the items will be implicitly clustered by yet unobserved features as well. As an intuitive example, imagine apples, oranges and bananas clustered by shape, color, size, weight etc. Once you have them clustered, you will be able to draw conclusions about a yet unobserved feature, e.g. the vitamin content. The work, because it is oriented on the features, might even be a way around the impossibility-theorem.

This is half-way there for a nicer definition of a taxon or what should constitute a cluster for that matter: can we draw conclusions about features not used for the clustering process? If you are clustering documents by topic (using bag-of-words), can you predict which other words will appear in the article? If you cluster genes, can you predict other genes from the cluster-membership?

Re-clustering on only a subset of the features should also be a sanity check for clustering solutions (I had written about the McIntyre-Blashfield procedure and validating clustering solutions before). I think strong patterns should replicate with less features; at least they did in a clustering-study I did recently 🙂 .

I’ll be pondering this idea and try formalizing it. Maybe I can come up with something usable for taxometrics and a means to get the number of clusters…