- Advertising (1)
- Artificial Intelligence (AI) (10)
- Classification (2)
- Clustering (1)
- Coding / Programming (7)
- Cryptography (1)
- Data Mining (14)
- ewrt linux (2)
- Fixing Stuff (5)
- Machine Learning (24)
- Math (1)
- Politics (3)
- Psychology (3)
- Ramblings (19)
- Random (7)
- Security (14)
- Society (10)
- Sociology (3)
- spam (2)
- Statistics (11)
- November 1, 2008 9:48 pm: Deploying SAS code in production
- October 15, 2008 12:18 am: Photo-based CAPTCHAs
- September 28, 2008 10:27 pm: Computer Models and the Mortgage Crisis
- September 1, 2008 8:19 pm: Can statistical models be intellectual property?
- August 21, 2008 9:17 pm: Taxons, Taxometrics and the Number of Clusters
- August 14, 2008 11:13 pm: CAPTCHAs - Not dead
- August 1, 2008 10:25 pm: ISC on the Future of Anti-Virus Protection
- July 12, 2008 4:41 pm: The cloud obscuring the scientific method
- June 22, 2008 5:05 pm: Debugging and Evaluating Predictive Models
- May 21, 2008 8:08 pm: Cult of the Amateur
Blogroll
Useful Links
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
ISC on the Future of Anti-Virus Protection
An article on the Internet Storm Center discusses wether Anti-Virus software in the current state is a dead end. In my opinion it has been dead for quite a while now. Apart from the absolutely un-usable state that anti-virus software is in, I think it’s protecting the wrong things. Most attacks (trojans, spyware) nowadays come through web-browser exploits and maybe instant-messenger (see reports on ISC). So instead of scanning incoming emails, how about a behavior blocker for the web-browser and the instant messenger? There are a couple of freeware programs (e.g. IEController [German]) out there that successfully put Internet Explorer, etc. into a sandbox; whatever Javascript exploit - known or unknown - the browser won’t be able to execute arbitrary files or write outside its cache-directory. Why is there nothing like that in the commercial AV packages?
However, a few possibilities suggested in the article might be worth exploring. For example, they suggest Bayesian heuristics to identify threats. Using machine learning techniques might be a direction worth exploring. IBM AntiVirus (maybe not the current version anymore) has been using Neural Networks with 4Byte sequences (n-grams) for bootsector virus detection.
A couple things to keep in mind, though:
- Quality of the classifier (detection rate) should be measured with Area-under-ROC-Curve (AUC), not error-rate like most people tend to do in Spam-Filter comparisons. The base-rate of the “non-virus” class is pretty high; I have over 10.000 executables/libraries on my windows machine. All (most?) of them non-malicious.
- The tricky part with that is the feature extraction. While sequences of bytes or strings extracted from a binary might be a good start, advanced features like call-graphs or imported API-calls should be used as well. This is pretty tricky and time-consuming, especially when it has to be done for different types of executables (Windows scripts, x86-EXE files, .Net files etc.). De-obfuscation techniques, just like in the signature based scanners, will probably be necessary before the features can be extracted.
- Behavior blocking and sandboxes are probably easier, a better short-term fix, and more pro-active. This has been my experience with email-based attacks as well back in the Mydoom days when a special mime-type auto-executed an attachment in Outlook. Interestingly there are only two programs out there that sanitize emails (check mime-types, headers, rename executable attachments etc.) at the gateway-level - a much better pro-active approach than simply detecting known threats. The first is Mimedefang, a sendmail plugin. The other is impsec, based on procmail. CU Boulder was using impsec to help keep student’s machines clean (there were scalability issues with the procmail solution, though).
August 5, 2008 5:08 am at 5:08 am (August 5, 2008)
As an aside: Would you be suggesting AUC for evaluating performance on ill-balanced testsets in general? How does this information differ from the common 2×2 matrix of classified vs true labels?
August 6, 2008 12:11 am at 12:11 am (August 6, 2008)
Yes, in my opinion using the AUC for ill-balanced datasets represents the true performance the best. It summarizes it all in one number - how do you decide which model is better when comparing 2×2 matrices?
A nice summary into all sorts of measures (AUC, Brier scores etc.) and how to determine how well a model actually works can be found here: http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.html#Methods%20for%20dichotomous%20forecasts