The RAND corporation just published an interesting paper exploring the benefits of using risk prediction to reduce the screening required at airports. You might have noticed various attempts to establish some kind of fast-lane or trusted traveler program. Obvious this is a very sensitive topic and probably hard to get right. Screening certain groups of the population more than others (“profiling”) is generally frowned upon and also not a good idea in general (see “Strong profiling is not mathematically optimal for discovering rare malfeasors on rare event detection“), but what hasn’t been examined much is identifying people that can be considered more “safe” than others. The paper explores that idea and shows that even under the assumption that the bad guys will try and subvert this program that there can be benefits to implementing this solution. The paper is a bit sparse on mathematical details. Certainly an interesting idea, though.
Archive for the ‘Society’ Category
Strong profiling is not mathematically optimal for discovering rare malfeasors (on rare event detection)Sunday, January 10th, 2010
Just in time for the latest Christmas terror scare, I came across an interesting paper: “Strong profiling is not mathematically optimal for discovering rare malfeasors” (William H. Press; PNAS 106(6), p. 1716-1719 www.pnas.org/cgi/doi/10.1073/pnas.0813202106). In the paper, the author investigates whether profiling by nationality or ethnicity can be justified mathematically and tries to answer the question of how much screening must we do, on average, to catch the bad guys in the crowd. Rare events detection is hard as it is, and it’s interesting to see a look from the sampling perspective. It’s an interesting and short read. Long story short, it shows that using an indiscriminate feature like nationality or ethnicity is not optimal (as is any screening at least in proportion to a prior probability) and wastes resources.
I just read another article discussing weather Risk Management tools had an impact on the current financial crisis. One of the most commonly used risk management measures is the Value-at-Risk (VaR) measure, a comparable measure that specifies a worst-case loss for some confidence interval. One of the major criticisms is (e.g. Nassim Nicholas Taleb, the author of the black swan) that the measure can be gamed. Risk can be hidden “in the rare event part” of the prediction and not surprisingly this seems to have happened.
Given that a common question during training with risk assessment software is “what do I do to get outcome/prediction x” from the software it should be explored how to safeguard in the software against users gaming the system. Think detecting multiple model evaluations with slightly changed numbers in a row…
Edit: I just found an instrument implemented as an Excel Spreadsheet. Good for prototyping something, but using that in practice is just asking people to fiddle with the numbers until the desired result is obtained. You couldn’t make it more user-friendly if you tried…
Recently I had a fun discussion with Bill over lunch about intellectual property and how that might apply to statistical modeling work. Given that there are more and more companies making a living from forming predictions with a model they have built (churn-prediction, credit-scores and other risk-models) we were wondering if there were any means of protecting them as intellectual property. For example, the ZETA-model for predicting corporate bankruptcies is a closely guarded secret with having published only the variables being used (Altman E. I. (2000); Predicting financial distress for companies: revisiting the Z-Score and ZETA models). Obviously this model is useful for lending and can make serious money for the user. Making decisions guided by a formula is becoming more popular. This might be something over which legal battles will be fought in the future.
Copyrighted works and patents often count towards what a company would be worth should somebody acquire it. This means there would be motivation for start-up companies to protect their models. A mathematical formula (e.g. a regression equation) cannot be patented, and copyright probably won’t apply either; even if copyright would apply, it’s trivial to build a formula that does essentially the same thing (e.g. multiply all the weights in the formula by 10). This leaves only trade secret protection and means there is no recourse once the cat is out of the bag. Often it’s also the data-collection method that is kept secret – a company called Epagogix developed a method to judge the success of movies from a script by scoring it against some scales that they keep secret.
Currently, I don’t see any legal protections with the exception of trade-secrets for this. And given that there is infinitely many ways to express the same scoring rules in a different way, this would be a fairly hard problem for lawyers and politicians to formulate sensible rules for establishing protection for this kind of intellectual property.
Pretty interesting short-film: http://www.youtube.com/watch?v=bd4Gpi9ksXw
I’ve followed the discussion and introduction of the GPL v3 for a bit. One major change in the license is supposed to close the loophole commonly referred to as the “tivoization” of GPL software, i.e. mechanisms that prevent people from tinkering with the product they bought which includes GPL software. Tivo, in particular, accomplishes this by requiring a valid cryptographic signature for the code to run – the user has access to the code, but it’s of no use. One of the main ideas of the GPL was to allow people the freedom to tinker, improve and understand how something works. This got me thinking a bit about software that uses machine learning techniques.
For the sake of the argument, let’s assume that somebody releases a GPL version of a speech recognition system, or say an improved version of a GPL speech recognition system. While the algorithms would be in the open for everyone to see, two major components of speech recognition systems, the Acoustic Model and Language Model, do not have to be. The Acoustic Model is created by taking a very large number of audio recordings of speech and their transcriptions (Speech Corpus) and ‘compiling’ them into statistical representations of the sounds that make up each word. The Language Model is a very large file containing the probabilities of certain sequences of words in order to narrow down the search for what was said.
A big part of how well the speech recognition system works relies on the training. The author who improved upon the software should publish the training set as well. Otherwise people won’t be able to tinker with the system or understand why the software works well.
The same would hold for things like a handwriting recognition system. One could publish it along with a model (a bunch of numbers) that make the recognition work. It would be pretty hard for somebody to reverse engineer what the training examples were and how training was conducted to make the system work. Getting the training data is the expensive part in building speech-recognition and handwriting-recognition systems.
Think Spam-Assassin – what if the authors suddenly decide to not make their training corpus available anymore? How would users be able to determine the weights for the different tests?
I don’t think this case is covered by the GPL (v3 or older) – (However, I’m not a lawyer). Somebody could include the model in C code (i.e. define those weights for a Neural Net as a bunch of consts) and then argue that all is included to compile the program from scratch as per the requirements of the license. However, the source by itself wouldn’t allow anybody to understand or change (in a meaningful way) what the program is doing. With the growing importance of machine learning methods just being able to recompile something won’t be enough. I think this should be taken into consideration by the open source community for GPL v3.01.
A (long) while ago there was a lot of talk about the IRS doing text-mining for people that did not report income they got from e.g. ebay internet auctions. The German version of this attempt is called XPIDER (see also here and there) was just shown to be totally ineffective. After years of trying and spending millions they did not identify conclusive evidence to go after a single tax-evader. The German GAO (Bundesrechnungshof) is trying to find out what went wrong, why and who misspend all that money. I wonder if the more sophisticated web-crawling program the Canadian IRS is using (it’s called XENON as reported by Wired) is similarly effective …
I found one most insightful piece on copyright law on youtube using the Amen break as an example.
I just read an article in the Wired blog titled “AI Cited for Unlicensed Practice of Law” citing a ruling from a court upholding its decision that the owner through the expert system he developed has given unlicensed legal advise. While an expert system is a clear cut case (as the system always does exactly what it was told [minus errors in the rules]; it just follows given rules and makes logical conclusions), this becomes more interesting in cases in which the machine learns or otherwise modifies its behavior over time. For example, lets say I put an AI software online that interacts with people and learns over time. Should I be held responsible if the program does something bad? What if I was not the person that taught it that particular behavior? This will probably be a topic that the courts will have to figure out in the future. For one, people should not be able to hide behind actions their computer has done. But what if it is reasonably beyond the capability of the individual to forsee what the AI has done?
This will probably end up being the next big challenge for courts just like the internet has been. It is interesting how the internet has created legal problems just with people being able to communicate more easily with each other: think trademark issues, advertising restrictions for tobacco or copyright violations (fair use differs from country to country; what is legal in one might be illegal in another) …
Update: And it just started. Check out this article: Colorado Woman Sues To Hold Web Crawlers To Contracts
As I enjoy going out with friends and mingling a lot, I noticed a very interesting trend lately. Some of my single friends will go and chat up some woman they find attractive (and sometimes with success) and if the woman is not interested, she will tend to show them a ring on her finger and tell them that she unfortunately is taken. So far, so good. I was out with some friends of mine. We were just talking and observed some gentleman asking a woman out right of the bat. She showed her ring and politely turned the guy down. Chris ended up talking to this lovely lady later – we all met as part of some group going out, and Chris and she had a longer conversation. After a long conversation, lots of laughter she excused herself for a minute. Once she came back from the bathroom, her ring was gone. He didn’t notice it at first, but as she suddenly became a lot more flirty, he asked her about it straight up. Her answer: “It’s a fake ring. Just to keep guys in bars from hitting on me.” They talked for a bit more about this and that, and she started hinting more and more that she would be very interested in going on a date with Chris. He thought about it, and walked away. Chris told me later that he just does not like to date woman that lie; trust and honesty are important to him. His reasoning was that if she is willing to lie (or: use little “white lies”) right from the start, how can one expect that it does not get worse over time? What if she uses little white lies to get out of every situation she does not want to be in?
What do we learn from this? A fake wedding ring might keep all the drunk loosers away, but will possibly confuse or scare off Mr. Right when he comes along. I’ve heard similar stories from a couple of other guys; it just does not go over well.