Archive for the ‘Coding / Programming’ Category

Programs stealing the input focus

Sunday, June 7th, 2009

Ok, this post is more of a rant. I’m one of those people that are a bit impatient when starting a program on my desktop. When I start up my Windows machine I click on several buttons in the “quicklaunch” bar to fire up what I’ll need to use – Outlook, R / SPSS /SAS, Winamp etc. So why do all sorts of dialogs pop up in my face while I am typing? Why does winamp have to pop up while I’m typing my email password? And why do they have to switch the input focus so that whatever I’ve happen to type now ends up in the wrong window? This is so annoying. Stealing the input focus is a known problem that has been written about countless times. It’s even against the GUI programming guidelines. “Do not steal the input focus” – what’s so difficult about that?

As a first consequence Norton Internet Security is now gone from my machine forever after it kept reminding me constantly – specifically with an uncanny accuracy when I was busy playing computer games – that I need to renew my anti-virus subscription or bad things will happen to my computer. And bad things did happen to my video game. But not anymore…

On the upside, there’s a carefully hidden option in the Windows XP Powertoys (TweakUI) that is supposed to prevent programs from stealing the input focus. It made things better, but doesn’t seem to work all the time.

Deploying SAS code in production

Saturday, November 1st, 2008

I had written a post about the issues of converting models into something that is usable in production environments as most stats-packages don’t have friendly interfaces to integrate them into the flow of processing data. I worked on a similar problem involving a script written in SAS recently. To be specific, some code for computing a risk-score in SAS had to be converted into Java and I was confronted with having to figure out the semantics of SAS code. I found a software to convert SAS code into Java and I have to say I was quite impressed with how well it worked. Converting one language (especially one for which there is no published grammar or other specification) into another is quite a task – after a bit of back and forth with support we got our code converted and the Java code worked on the first try. I wish there would be similar converter for STATA, R and SPSS :-)

License Key Copy Protection

Saturday, February 2nd, 2008

I had written long before I had a blog about doing copy protection the right way (read: so it requires some effort to remove it). With the more recent programing frameworks a couple of things changed. For one, all the new programing frameworks (.NET, Java) have cryptography support, which means it is far simpler to incorporate a license key scheme based on digital signatures. Personally I like using 512bit-DSA (Digital Signature Algorithm) for these purposes, because the key is long enough to stop amateurs from computing the secret key and short enough that a signature encoded as BASE32 could be typed in by someone. In the software you obviously only include your public key for verifying the signature of your license.

One issue with byte-code languages is that they are fairly trivial to disassemble (and produce very human readable code). Microsoft even included an MSIL Disassembler with .NET. Therefore the code needs to be obfuscated. One thing to try (not all obfuscater-programs support this) is to make the names as human-unfriendly as possible. Renaming classes to “A” and “B” is nice, but renaming them to “XfGkoAlPPqzz” and “XfGkoBlPPqzz” makes it even harder to read.

With a signature-based license key the hackers have only a few options left since they can’t write a key-generator. For one, they could patch a different key into the software for which they know the private key and generate signatures for this new key. It’s therefore important to use a hash of the real private key in some other ways in the program. For example, one could hash the private-key string with SHA/MD5 and use the result as a key for decrypting some data with a symmetric cypher such as AES. Another idea is to put the hash (or a checksum using the hash of some data and the public key) in some saved user data in order to prevent data-exchange between the cracked and legitimate versions of the software.

The second option is to find where the accept/reject decision is made in the program and to patch this comparison. Note that API-calls to the framework are fairly easy to find, even in obfuscated code. You should therefore have the API-call (for the crypto-functions as well as displaying any kind of error-message like “invalid license key”) far away from the comparison-operation so the hackers have to dig through more code. Also, I found that using a command-pattern makes for fairly unintelligible byte-code. Consider creating a class that encapsulates commands and has some Execute() method that can also return which command to execute next. Now, if you have a couple of commands in a List that are executed in a loop by invoking all the respective Execute() methods, then that is a bit harder to follow. Consider writing commands that can change entries in some global hash-table of strings, an If-Command, a Goto-Command, a MessageBox-Comand, a Verify-Signature-Command etc. With all those implemented, you can encode a little program by putting all those commands in a list. The important thing now is to encode the accept/reject decision (and one or two important parts of your software unrelated to the license-key) using the Command-Pattern, i.e. to write little programs in your “command-pattern-programing-language”. The If-Statement being used for the accept/reject will then be the same spot every other condition is tested in your code and can therefore not be patched without destroying other parts of the program. This forces the hackers to understand your command-pattern and to figure out what bytes they have to change to make this an unconditional jump. I found this fairly easy, clean and straight-forward to program (and debug) in a high-level language, yet fairly hard to understand in obfuscated byte-code.

Modeling Tree Structures and Hierarchies in a Data Warehouse

Friday, July 20th, 2007

I’m currently reading a book about Data Warehouse design (“Mastering Data Warehouse Design”, Claudia Imhoff, Nicholas Galemmo and Jonathan Geiger, Wiley). One thing I noticed is the incredibly inefficient way the authors encode trees in relational databases. Their suggestion is to model it with “pointers” to child-nodes which is incredibly inefficient to deal with in SQL, leads to recursive queries (unless proprietary SQL extensions are used) and you’ll have to write loads of self-joins. A much better way of encoding trees in SQL is based on nested sets. However, it always depends on what kind of queries you will run later. According to the book, those would be listing elements on the same level of the tree as well as retrieving sub-trees. This is something that I think is still kind of painful when using nested sets.

Here is my favorite solution for encoding trees in SQL:

CREATE TABLE TreeMagic (Mykey CHAR(10) PRIMARY KEY, FatherNode ChAR(10) NOT NULL, length INTEGER NOT NULL);

MyKey FatherNode Length
A   1
AA A 2
AB A 2
AAA AA 3
AAB AA 3
ABA AB 3
ABB AB 3

So key A is the root, AA and AB are child-nodes of A, AAA and AAB are child-nodes of AA, and so on. The cool part is traversing the tree on one level is easy due to the length-field and selecting subtrees is easy as well using the like-operator, which moves all the hard work into the B-Tree index. Inserting a new node is simpler than with set-based trees for which in the worst case you might have to increase the left/right numbers at a couple of other nodes. The path to each node is always fully known. Traversing the tree in DFS comes for free by lexicographical ordering of the keys with an “order by” clause. BFS is available when ordered by “length.FatherNode”. One operation the nested sets can do faster is determining the number child-nodes just with a bit of math. Either way I think this idea is way more efficient than the one proposed in the book :)

Edit: I just found a book that is full of patterns for modeling the most common structures in relational databases: “SQL for Smarties – Advanced SQL Programming” by Joe Celko (who also wrote the article about the nested-set method for trees). It contains various graph problems and how to manage them in relational databases. Looks interesting…

Safe Strings in PHP (2)

Sunday, July 1st, 2007

I wrote about the problems with PHP strings here and the possible solution I liked using a class encapsulating strings in PHP. I now worked out some details to make every string function in PHP work with the new “SafeString”-class. You can find the details and source here. This is still more a proof-of-concept and for all practical purposes would require the re-writing of a couple of things like database abstraction layers and such to return SafeStrings as well.

Safe Strings in PHP

Monday, May 7th, 2007

A while ago I read about an idea to make it easier to avoid common programming mistakes in PHP regarding the handling of strings. There are dozens of attacks that one must pay attention to when using strings: you have to escape your string one way when you embed it in an SQL statement, escape it in a different way when outputting it as part of a web-page (XSL attacks), and escape it in a third way when you output it as part of a HTTP-header. It’s not surprising that eventually somewhere something will be not escaped in the right way.

Wells suggests a SafeString class to encapsulate all Strings in a class with different access methods that automatically escape your string the right way. So if you were to output the string back to the user, you’d call a toHTML() method that properly escapes any HTML-tags and special characters embedded in the string. A method to access the raw string would be called “UnsafeRawString” to remind the programmer that the string contains “tainted” user-input. While it is still possible to do something wrong, these parts stick out in the code (for example, one might use String->toHTML() when using it in an SQL statement – obviously wrong, but much easier to find). See “Making Wrong Code look Wrong” for the underlying philosophy.

I really like the idea, but I see a couple of practical problems with this idea:

  • All strings, including Server variables and Super-Globals, should be automatically converted to the new String class. Otherwise the programmer has to constantly figure out if he/she is dealing with an encapsulated string or not.
  • You’d need a database abstraction layer that will return these kind of strings as results of queries.
  • All the existing PHP string operations (from strcmp to soundex) must be usable. This can be tricky, but interestingly PHP5 offers a way with __call to overload the object with arbitrarily named functions (see overload() function in PHP4). With some eval-magic this could be doable. Technically you wouldn’t want anybody to ever to work with the UnsafeRawString…

Sequential Sampling and Machine Learning

Monday, January 8th, 2007

In order to estimate an unknown quantity mu a common approach is to design an experiment that results in a random variable Z distributed within the interval [0,1]. The expectation E[Z]=μ can then be estimated by running this experiment independently, averaging the outcomes, and using Monte-Carlo techniques for the estimate. In (Dagum, Karp, Luby and Ross, SIAM Computing,1995) the AA algorithm (“Approximation Algorithm”) is introduced which, given epsilon and delta and independent experiments for the random variable Z, produces an estimate of the mean (or the true expectation) that is within the factor of 1+ε of μ with probability of success of at least 1-δ. Note that there are no distributional assumptions by the algorithm. This has a couple of applications in machine learning, for example in Bayesian Inference, Bayesian Networks and Boosting (Domingo and Watanabe, PAC-KDD, 2000).

The AA algorithm works in three steps. First, the stopping rule computes an initial estimate of the mean. Then, the variance is determined and, in the third step, additional samples are taken to approximate the expectation even further. A small improvement for the stopping rule in step one can be made as follows. The algorithm assumes a non-zero expectation and keeps sampling until the sum of the elements is larger than a constant determined by epsilon and delta (read the paper to see why that works). The problem is that the closer the elements are to zero, the more elements are needed.

Observe that the following holds for the mean:

With that one can improve the stopping rule as follows:

P.S.: To type Greek letters into WordPress use the html named entities such as ε for ε. That took me forever …

Google Co-Op

Thursday, October 26th, 2006

Google recently started their new Co-Op service (see here) which allows users to create their own personalized search-engines to whichever topic they desire. More importantly, it allows for inclusion of the search into the website of the whoever created the search-engine. Especially the later is nice as before it was only possible to either drive the user away to the Googles’ homepage, open it in an iframe or a new window – all stuff that webmasters wouldn’t want for one reason or the other. I just created a search-engine for machine-learning related materials. I’m still adding sites, but I’m getting better results for my personal searches already. I also wrote a module for the PHPNuke content management system to include whichever personal CoOp search-engine you created into the system. You can download the Search-The-Web-module for PHP-Nuke on my website. It even comes with English, German and French language-files (although the French translation might be a bit funny).