License Key Copy Protection

February 2nd, 2008

I had written long before I had a blog about doing copy protection the right way (read: so it requires some effort to remove it). With the more recent programing frameworks a couple of things changed. For one, all the new programing frameworks (.NET, Java) have cryptography support, which means it is far simpler to incorporate a license key scheme based on digital signatures. Personally I like using 512bit-DSA (Digital Signature Algorithm) for these purposes, because the key is long enough to stop amateurs from computing the secret key and short enough that a signature encoded as BASE32 could be typed in by someone. In the software you obviously only include your public key for verifying the signature of your license.

One issue with byte-code languages is that they are fairly trivial to disassemble (and produce very human readable code). Microsoft even included an MSIL Disassembler with .NET. Therefore the code needs to be obfuscated. One thing to try (not all obfuscater-programs support this) is to make the names as human-unfriendly as possible. Renaming classes to “A” and “B” is nice, but renaming them to “XfGkoAlPPqzz” and “XfGkoBlPPqzz” makes it even harder to read.

With a signature-based license key the hackers have only a few options left since they can’t write a key-generator. For one, they could patch a different key into the software for which they know the private key and generate signatures for this new key. It’s therefore important to use a hash of the real private key in some other ways in the program. For example, one could hash the private-key string with SHA/MD5 and use the result as a key for decrypting some data with a symmetric cypher such as AES. Another idea is to put the hash (or a checksum using the hash of some data and the public key) in some saved user data in order to prevent data-exchange between the cracked and legitimate versions of the software.

The second option is to find where the accept/reject decision is made in the program and to patch this comparison. Note that API-calls to the framework are fairly easy to find, even in obfuscated code. You should therefore have the API-call (for the crypto-functions as well as displaying any kind of error-message like “invalid license key”) far away from the comparison-operation so the hackers have to dig through more code. Also, I found that using a command-pattern makes for fairly unintelligible byte-code. Consider creating a class that encapsulates commands and has some Execute() method that can also return which command to execute next. Now, if you have a couple of commands in a List that are executed in a loop by invoking all the respective Execute() methods, then that is a bit harder to follow. Consider writing commands that can change entries in some global hash-table of strings, an If-Command, a Goto-Command, a MessageBox-Comand, a Verify-Signature-Command etc. With all those implemented, you can encode a little program by putting all those commands in a list. The important thing now is to encode the accept/reject decision (and one or two important parts of your software unrelated to the license-key) using the Command-Pattern, i.e. to write little programs in your “command-pattern-programing-language”. The If-Statement being used for the accept/reject will then be the same spot every other condition is tested in your code and can therefore not be patched without destroying other parts of the program. This forces the hackers to understand your command-pattern and to figure out what bytes they have to change to make this an unconditional jump. I found this fairly easy, clean and straight-forward to program (and debug) in a high-level language, yet fairly hard to understand in obfuscated byte-code.

Registering Domains with Network Solutions

January 8th, 2008

After reading this article on Slashdot about NSI immediately registering every free domain that is searched for on their site, I went ahead and tried it myself. Indeed, seconds after searching for two random domain-names they were immediately registered (or locked). They even put a domain-parking page on it. Since this is all fully automated I can’t help but wonder what would happen if somebody were to search for all sorts of trademarked names, especially from companies that are fairly aggressive in suing for trademark infringements. I wonder if they thought about that …

Joe-job …

January 7th, 2008

Seems like I’m having a good start into the new year: some spammers are spoofing from-addresses from one of my domains. 1500 bounces is probably just the beginning … 🙁

Back from NIPS 2007

December 11th, 2007

Just got back home from NIPS. The following papers I found pretty interesting:

  • Random Features for Large-Scale Kernel Machines
  • On Ranking in Survival Analysis: Bounds on the Concordance Index
  • Efficient Inference for Distributions on Permutations

The workshop time I spend in the Learning Problem design and in the Security workshop. I also dropped by in “statistical networks” briefly, but there’s room for improvement in my current understanding of Gibbs sampling and the like. The consensus in the problem design workshop seemed to be that machine learning must become more modular. Also there was agreement that the application of machine learning in the real world requires some magic for transforming the problem into “features” and some more magic for transforming the prediction into something useful. It was stating the obvious a bit, however not much progress has been made in this area of making ML more accessible. I wrote about choosing the right features before, but currently it’s more of an art than a science. One thing I took from the security workshop was that features must be easily constructed (most detection apps must run in real-time). This means we are interested in features the attacker can hardly influence (think received-headers in Spam-emails that can not be suppressed), yet they must be easily to compute.

Also really cool was the “NIPS Elevator Process” joke-paper about hungry scientists on the way to lunch (don’t confuse it with the Chinese restaurant process) and the party crashers at the Gatsby party. Sophie and some friends of hers simply joined the fun. The fun part was people taking her random answers for her research topic seriously 🙂 I got mistaken for one of the party crashers at one point, because I didn’t fit in with my clothing. I was actually planning on hitting a club in downtown Whistler, but didn’t get around to go in the end…

GMail Logout Strangeness

November 24th, 2007

I’m using many of the services Google has to offer, GMail being one of the many. I’ve noticed a couple of times now that when I logout from Google’s single-sign-on, but then go back to GMail (type in URL, not back-button) I’m still logged in despite that the Google main page or any of the other services. I can even access all sorts of old email so it’s not some strange cache-issue. I can’t quite reliably reproduce it, but it happens somewhat frequently.

I’m wondering whether Firefox does something strange in the way it clears cookies or does Google use an extra authentication-cookie for GMail that is not always deleted.

Artificial Addition (Overcoming BIAS)

November 23rd, 2007

I found the following article interesting: http://www.overcomingbias.com/2007/11/artificial-addi.html

KMail and GPG integration in Ubuntu (117440523 gpgme_op_decrypt_verify)

October 9th, 2007

After installing the various gpg-agent packages (gpgsm, gpgagent etc.) and still no luck a simple “sudo apt-get install pinentry-qt” did the trick (installs the password-entry dialog). Note that you have to start the gpg-agent manually (eval `gpg-agent –daemon`) before starting KMail.

The GPL and Machine Learning Software – Should the GPL cover training data?

October 1st, 2007

I’ve followed the discussion and introduction of the GPL v3 for a bit. One major change in the license is supposed to close the loophole commonly referred to as the “tivoization” of GPL software, i.e. mechanisms that prevent people from tinkering with the product they bought which includes GPL software. Tivo, in particular, accomplishes this by requiring a valid cryptographic signature for the code to run – the user has access to the code, but it’s of no use. One of the main ideas of the GPL was to allow people the freedom to tinker, improve and understand how something works. This got me thinking a bit about software that uses machine learning techniques.

For the sake of the argument, let’s assume that somebody releases a GPL version of a speech recognition system, or say an improved version of a GPL speech recognition system. While the algorithms would be in the open for everyone to see, two major components of speech recognition systems, the Acoustic Model and Language Model, do not have to be. The Acoustic Model is created by taking a very large number of audio recordings of speech and their transcriptions (Speech Corpus) and ‘compiling’ them into statistical representations of the sounds that make up each word. The Language Model is a very large file containing the probabilities of certain sequences of words in order to narrow down the search for what was said.

A big part of how well the speech recognition system works relies on the training. The author who improved upon the software should publish the training set as well. Otherwise people won’t be able to tinker with the system or understand why the software works well.

The same would hold for things like a handwriting recognition system. One could publish it along with a model (a bunch of numbers) that make the recognition work. It would be pretty hard for somebody to reverse engineer what the training examples were and how training was conducted to make the system work. Getting the training data is the expensive part in building speech-recognition and handwriting-recognition systems.

Think Spam-Assassin – what if the authors suddenly decide to not make their training corpus available anymore? How would users be able to determine the weights for the different tests?

I don’t think this case is covered by the GPL (v3 or older) – (However, I’m not a lawyer). Somebody could include the model in C code (i.e. define those weights for a Neural Net as a bunch of consts) and then argue that all is included to compile the program from scratch as per the requirements of the license. However, the source by itself wouldn’t allow anybody to understand or change (in a meaningful way) what the program is doing. With the growing importance of machine learning methods just being able to recompile something won’t be enough. I think this should be taken into consideration by the open source community for GPL v3.01.

I passed my PhD defense

September 19th, 2007

Hurray…

Advertising and Data Mining

August 30th, 2007

Lately I’ve become fascinated with the field of Advertising and Marketing. Honestly I’ve never paid much attention to ads before as most of them were just not interesting to me or were just plain horrible. I just finished reading a good book about the choice of headlines and direct marketing and how it is actually tested what gets the most responses (“Tested Advertising Methods”, John Caples).

There’s obviously a lot that could be done with machine learning. For example, some kind of predictor that learns from previous ads (or possibly over ads from various companies) how successful they have been and then predicts return rates for new ads. Google and Yahoo probably do something like this already …