CAPTCHAs – Not dead

August 14th, 2008

I recently attended a talk where the authors claimed that the CAPTCHA technology (the squiggly letters they make you type in whenever you sign up for anything) is dead and defeated. I disagree. In the talk, they demonstrated how to break a couple of “home-brew” captcha-implementations they found on the internet. Most of them were – not surprisingly – not very good. I think this is almost comparable to people inventing there own encryption algorithms.

All the implementations they broke were either insecure implementations (accepting solutions several times, hiding the answer in an invisible form field etc.) or were simply writing numbers in images with little or no distortion. The audio captcha they broke was simply reading numbers with a little bit of clicking noise in the background. Those are all very simple. What is supposed to make “real” captchas hard is that they are hard to segment – compare the phpBB captcha with the one from Yahoo. In the later you will have problems separating the letters for your OCR. A good audio captcha overlays music, chatter or other noise that is hard to separate from the code being read.

Just like home brew cryptography, it is probably a good idea to use an established technology (think reCAPTCHA) that was made by people with a background in OCR. Edit: A nice recap of how well the reCAPTCHA project is coming along is in ArsTechnica.

ISC on the Future of Anti-Virus Protection

August 1st, 2008

An article on the Internet Storm Center discusses wether Anti-Virus software in the current state is a dead end. In my opinion it has been dead for quite a while now. Apart from the absolutely un-usable state that anti-virus software is in, I think it’s protecting the wrong things. Most attacks (trojans, spyware) nowadays come through web-browser exploits and maybe instant-messenger (see reports on ISC). So instead of scanning incoming emails, how about a behavior blocker for the web-browser and the instant messenger? There are a couple of freeware programs (e.g. IEController [German]) out there that successfully put Internet Explorer, etc. into a sandbox; whatever Javascript exploit – known or unknown – the browser won’t be able to execute arbitrary files or write outside its cache-directory. Why is there nothing like that in the commercial AV packages?

However, a few possibilities suggested in the article might be worth exploring. For example, they suggest Bayesian heuristics to identify threats. Using machine learning techniques might be a direction worth exploring. IBM AntiVirus (maybe not the current version anymore) has been using Neural Networks with 4Byte sequences (n-grams) for bootsector virus detection.

A couple things to keep in mind, though:

  • Quality of the classifier (detection rate) should be measured with Area-under-ROC-Curve (AUC), not error-rate like most people tend to do in Spam-Filter comparisons. The base-rate of the “non-virus” class is pretty high; I have over 10.000 executables/libraries on my windows machine. All (most?) of them non-malicious.
  • The tricky part with that is the feature extraction. While sequences of bytes or strings extracted from a binary might be a good start, advanced features like call-graphs or imported API-calls should be used as well. This is pretty tricky and time-consuming, especially when it has to be done for different types of executables (Windows scripts, x86-EXE files, .Net files etc.). De-obfuscation techniques, just like in the signature based scanners, will probably be necessary before the features can be extracted.
  • Behavior blocking and sandboxes are probably easier, a better short-term fix, and more pro-active. This has been my experience with email-based attacks as well back in the Mydoom days when a special mime-type auto-executed an attachment in Outlook. Interestingly there are only two programs out there that sanitize emails (check mime-types, headers, rename executable attachments etc.) at the gateway-level – a much better pro-active approach than simply detecting known threats. The first is Mimedefang, a sendmail plugin. The other is impsec, based on procmail. CU Boulder was using impsec to help keep student’s machines clean (there were scalability issues with the procmail solution, though).

The cloud obscuring the scientific method

July 12th, 2008

“All models are wrong, and increasingly you can succeed without them” — George Box
Sometimes…” — Me

In a Wired article about the Peta-byte age of data processing the author claimed that given the enormous amounts of data and the patterns found by data mining we are less and less dependent on scientific theory. This has been strongly disputed (see Why the cloud cannot obscure the Scientific Method) as the author simply ignores the fact that all the patterns that were found are not necessarily exploitable – finding a group of genes that interact is a first step, but won’t cure cancer. However, in machine translation or placing advertising online one can succeed with little to no domain knowledge. That is, once somebody comes up with the right features to use (see Choosing the right features for Data Mining).

What would be interesting to develop, however, is a “meta-learning” algorithm that can abstract from simpler models and learn e.g. a differential equation. For example, lets take data from several hundred Physics experiments about heat-distribution conducted on different surfaces etc. We can probably learn a regression model for one particular experiment which could predict how the heat will distribute given the parameters of the experiment (material, surface etc.). The meta-learning algorithm would then look at these models and somehow come up with the heat-equation. That would be something…

Debugging and Evaluating Predictive Models

June 22nd, 2008

Speaking of the recent revelation that Moody’s mis-rated CPDOs due to a computer glitch (see also here) it is quite tricky to get models right from construction to production. Interestingly S&P, which gave identical ratings on many of them, stands by their independently developed model and makes me wonder how much models are tested in practice. This made me wonder if I would catch whatever bug while moving a model into production.

Most of the data-mining and modeling work is done in your favorite stats-package (SPSS, SAS etc.), but in production use people have to re-implement whichever equation they came up with to produce their risk-scores. When I’m doing Data Mining work I often use different tools from different authors or vendors to get the job done, as not every single one tool can do everything that I need. For example, I have had a lot of success with predictive modeling using Support Vector Machines. Some newer algorithms I have implemented in Matlab myself. That means I spend a lot of time converting data back and forth between different software (e.g. indicating a missing value). I usually do this with some Perl-scripts I wrote, but the entire process is error prone, especially given that the final model then has to be codified in some other language (Java, C, C#/.Net or whatever) so it can be incorporated into the projects software. It takes a while to get it right, because more often than not an error is not obvious (read: I had bad experiences with subtle errors during black-box testing). The following is my check-list for debugging the process (probably not complete to catch everything):

  • Do the results mean what you think they mean? What values for classification are good or bad? (big/small scores, or +1/-1 …)
  • Features: when exporting/importing the data, is the order of the features the same? Is the classification label in the right place? This is a lot of fun when you export stuff from SPSS/R/STATA to Matlab (which does not support named-columns in a matrix – better get all those indexes right)
  • Where missing values treated the right way when building the model? There are many ways to deal with them, and you might have either case of MAR (“missing at random”), MCAR (“missing completely at random”), NMAR (“not missing at random”), non-response and imputation (single and multiple) etc.
  • Did I deal with special values correctly? I’m not talking about the NULL value in the database, but “flag-values” such as 999 etc. to indicate a certain condition
  • When exporting/importing data, are special values (e.g. missing, flag and categorical values) handled correctly? Every program encodes them differently, especially when you use comma separated text (csv)
  • Is the scaling of the data the same? Are the scaled values of the new data larger/smaller than what the classifier was trained on? How are these cases handled?
  • Are the algorithm parameters the same, e.g. a kernel-sigma in a support vector model?
  • Is the new data from the same distribution? (When I hear “similar”, then it usually does not work in my experience 🙂 ) Check mean and variance for a start. Sometimes the difference can be subtle (e.g. a male-only model applied to females; this can work, depending on what was modeled). Was the data extracted the same way with the same SQL query? Was the query translated correctly into SQL? Was the result prepared the same way (recodes, scaling)?
  • In my code I check for each attribute if it is within the range of my training set. If not, then it’s either a bug (scaling?) or the model can’t reliably predict for this case.
  • Some simple test-cases, computed with your Stats-Package and your production code. I had a lot of success with White-Box tests in testing recode-tables etc.

As for the model evaluation I’ve read some reports in the past (not financial scorings, though) were testing was done on the training set. Obviously model quality should be assessed on a hold-out data set that has not been used for training or parameter tuning. Model quality in the machine learning community is still often evaluated using error-rate, but lately Area under the Receiver-Operator Characteristic has become popular (often abbreviated as AuROC, AUROC, AROC or ROC), which I found to be especially useful for imbalanced datasets. In Meteorology a lot of thought has been placed into the evaluation and comparison of the performance of different predictive models. Wilks Cost-Curves and Brier Skill-Scores look really interesting. In some models, although the predictor is trained on a dichotomous variable, is really predicting some risk over time – and should be evaluated using survival analysis (e.g. higher risk-scores should lead to sooner failure etc.). In survival analysis a different version of the AuROC is used called the concordance index. I’ll post some of my thoughts on all the evaluation scores some time in the future.

Cult of the Amateur

May 21st, 2008

I just read the book Cult of the Amateur and quite frankly disagree with a lot written in it. To make a long story short, the authors world is one where there is a simple transcendental reality; a truth, purveyed by trained experts, journalists and professionals. Apparently the danger of dissolution is upon us by the radically relativistic truths of Wikipedia where the community sets the agenda. I think things are far less black and white than he makes them seem to be. One particular example I very much disagree with is his criticism of recommender systems. The author claims that nobody will need to read movie reviews anymore when there are AI systems making the recommendations. I have to say that so far I found movie ratings by experts – be it ratings in IMDB or a professional review in a newspaper – totally useless. I’ve liked films with low ratings, and hated others with high-ratings. I discovered so many things that I liked with recommender systems like Pandora. Yes, there are problems with recommender systems, but relying on an expert opinion for matters of taste can in my opinion never work out. People simply have different tastes and I don’t think that one reviewer writing movie-reviews for a particular newspaper is speaking for the entire readership. For another wonderful example using Spaghetti Sauce as an illustration, see the TED talk from Malcom Gladwell. That said, there are some points in the book that deserve consideration. Anybody who has ever read through comments (“noise”) on youtube knows that there seems to be a mass-infestation of stupidity out there – something that needs to be taken into consideration in all the Web 2.0 experiments. Stupidfilter anyone? 🙂

ART OF SEDUCTION: Not Pretty, Really

April 21st, 2008

Pretty interesting short-film: http://www.youtube.com/watch?v=bd4Gpi9ksXw

“Internal Server Error” when converting phpBB v2 to phpBB v3

March 25th, 2008

I’m hosting a little phpBB installation and had some problems with the conversion script that comes with phpBB for the conversion of the forum. It seems that the timeout-values for PHP by 1and1 are set too conservatively. I found that adding the following lines to the “install/install_convert.php” file does the trick (credit for this trick) :

@set_time_limit(0);
@ini_set(‘memory_limit’, ‘256M’);
@ini_set(‘upload_max_filesize’, ‘128M’);
@ini_set(‘post_max_size’, ‘256M’);
@ini_set(‘max_input_time’, ‘-1’);
@ini_set(‘max_execution_time’, ‘-1’);
@ini_set(‘expect.timeout’, ‘-1’);
@ini_set(‘default_socket_timeout’, ‘-1’);

Having the conversion script reload also seems to help a bit…

Firewire and DRM

March 6th, 2008

An old security vulnerability in the Windows Firewire implementation has resurfaced. I wonder how long it takes until Microsoft figures out that if one can read/write arbitrary memory (including kernel-memory), then one can probably find encryption keys to “secured media content” …

Using Psychological Domain Knowledge for the Netflix Challenge

February 28th, 2008

I read an interesting article today about using psychological domain knowledge for improving recommender-system predictions. A very interesting idea…

VPN Tunnels from within VMWare (Windows XP and GRE weirdness)

February 12th, 2008

I was playing around with the VMWare player and an Windows XP image trying to establish a VPN connection with Microsoft’s VPN Client. It worked just fine, connected and then got stuck at “Verifying Username and Password”. After a while it aborted with a time-out error (was it error 638 or 721?). It turns out that GRE (General Routing Encapsulation) doesn’t deal well with multiple network address translations (e.g. using VMWare Networks with NAT and then my DSL-Router). It worked once I changed it to bridged network. This took me a couple of hours to figure out…