Archive for February, 2007

Validating patterns found by Data Mining techniques

Thursday, February 22nd, 2007

I just read in the news about a study that showed with data mining that libras live longer (study from the Institute for Clinical Evaluative Sciences in Toronto). These guys did a study for fun using data from 10 million Ontario residents looking for associations between various health problems and their astrological signs. And they actually found associations! Each of the twelve astrological signs had at least two medical disorders associated with them. However, they were not able to replicate it when looking for the same pattern in the hold-out set.

This is a nice example of why it is important to have a hold-out set, or ideally to try to redo a study with a different method on a different set of similar data. A couple of other methods for validating have been proposed which are harder to do. In my data mining work with Tim we have used a cross-replication and cluster validation design from the book “Classification” by Gordon (1999). The original method (I think) was proposed by McIntyre and Blashfield (“A nearest-Centroid technique for evaluating the minimum variance clustering procedure”, Multivariate Behavioral Research 15, 1980; see also Milligan 1996 “Clustering Validation: results and implications for applied analysis, in “Clustering and Classification”; eds. Arabie, Hubert and De Soete, World Scientific, Singapore, p. 341). What I particularly like about the method is that it can be used to quantify in numbers “how much” of a replication you got. The method works like this:

  • Divide the data randomly into two subsets
  • Do your exploratory clustering work on set A, partitioning it into k many clusters
  • Use a classifier (Nearest Neighbor for example) to assign each object in B to each discovered cluster
  • Use the same clustering procedure on B, partitioning it into k many clusters
  • Compare the two labelings obtained for B (the classification and clustering) in a cross-tabulation, compute a kappa-coefficient

Given our tendency to find patterns in data (or see shapes in clouds) I think it is important to use a procedure like the above to double-check the patterns discovered before any important decisions are made.

EM-Clustering when double isn’t enough…

Tuesday, February 13th, 2007

One of the more interesting developments in clustering (in my opinion) is clustering of data the is on a unit hypersphere. It sounds like some rare special case at first, but appears quite frequently in real life applications such as Spectral Embeddings in Spectral Clustering, some subdomains of Bio-Informatics data or text-data in TFIDF representation. The data can be analyzed as unit vectors on a d-dimensional hypersphere, or equivalently are directional in nature. Spectral clustering techniques generate embeddings in the normalization step that constitute an example of directional data and can result in different shapes on a hypersphere.

The first paper published that suggested a good clustering algorithm presented an Expectation Maximization (EM) algorithm for the von Mises-Fisher distribution (Banerjee et.al, JMLR (6) 2005). Avleen and myself started to work on extension for this that utilizes the Watson distribution, a distribution for directional data that has more modeling capability than the simple von Mises-Fisher distribution. We just published our results for the Watson EM clustering algorithm in the AISTATS 2007 conference to be held in March (Matlab code will be available soon).

One problem with both algorithms is that they require a high precission number representation in order to work well for high-dimensional problems such as bio-informatics data and text. Most prior work with directional data was limited to maybe 3 dimensional cases, and most Kummer-function approximations (another problem we had to address) work well only for the lower dimensional cases. In our AISTATS paper we only presented results for lower dimensional embeddings as we had some problems getting it to work for higher dimensional data (also, the root-solver that was involved is just incapable of handling larger problems). We have been working on a speedup with some success, but I have to say that it was mostly the numerical problems that gave us a hard time.

More and more Machine Learning techniques require a more careful consideration of numerical problems (Support Vector Machines, my manifold clustering algorithm etc.) and I run into numerical problems every other day. While trying to improve our Watson-EM algorithm I found out that Continued Fractions have many desirable properties such as the unique, finite representation of rational numbers. Numbers can be represented exact with no numerical error. In the EM algorithm we use them to approximate the Kummer function. Maybe a more exact number representation for fractions can be made out of this?

So I started looking and found in an tech-report that the number representation of continued fractions can be nicely implemented in Prolog. It also explains how to add up numbers in a Continued Fraction Representation and so on with an arbitrary precision.

I haven’t found any papers yet that suggest a suitable hardware implementation of Continued Fractions to replace the IEEE floating point numbers we use nowadays, but it can probably be done.

Artificial Intelligence and Sports

Thursday, February 8th, 2007

A couple of days ago Indianapolis won the Superbowl – just as predicted by an Electronic Arts Simulation. The simulation software had been fed with the latest data about all the players involved and they had the game AI fight it out. In the past some of the simulated outcomes were not that close to the final scores, but they still did a fairly decent job in 2005 and 2007.

There is more and more statistical decision making in baseball as well, the most famous example being the Miami-Orlando series in the 1997 playoffs.

Interesting…

Computer Security and Psychology

Friday, February 2nd, 2007

Bruce Schneier gave a speech of how human psychology affects computer security. Very true as security software is often too cumbersome to use. Email encryption is still not common place while SSL as an end-to-end encryption is. It’s easy to use and people have been trained to look for that little golden padlock in the corner before entering their credit-card. Yet I feel that there are a couple of things that could be done to encourage people to pay more attention when it comes to computer security related things. In my opinion this isn’t happening because:

  1. Most people are good and assume that other people are good too. They hold the door open for the guy that left his badge in the car, they click on the “cool link”, they open email that looks like it might be from someone important.
  2. Most people see security problems as something that happens to someone else. Most breaches are never publicized, some publicized breaches are so huge (millions of credit card number copied – yet nothing happens to them or anybody they know) – this enhances the belief in the low likelihood of problems. We feel save in a crowd.
  3. Most people believe they know what they are doing. Some other people are pretty learning-resistant when it comes to computers. I’ve heard some stories from companies in which the IT-staff is supposed to do user-training as well in addition to the external training the people received in the beginning (try to get accounting to explain to you over and over again how to file reimbursement claims). Maybe we really need a computer-drivers-test, but then again drunk driving can kill people while drunk computing can not.
  4. People get bored. Cry Wolf too often, ask a person to be careful too many times in the face of a relatively low-probability event and they become trained to click “Yes, I’m sure.” (This will be interesting with Windows Vista) We are constantly bombarded with awareness-programs which makes the IT-security awareness compete with many other awareness-programs.
  5. There is no incentive. Most people (employees) don’t face consequences when their PC is infected or the company database gets stolen. People have the neighbors kid come over to remove all the spyware from the machine and so on. Avoidable security problems like spyware turn into a “car maintenance problem”.

I think on the incentive side there is a lot that can be done. In the industry a lot experience has been gained with safety incentive programs to reduce accidents. I found a study cited on a website where it states that the reinforcing safe of acts “removes the unwanted side effects with discipline and the use of penalties; it increases the employees’ job satisfaction; it enhances the relationship between the supervisor and employees” (McAfee and Winn 1989). Properly designed incentives have the approval of the people to whom they are addressed, and are often preferred to other forms of safety motivation such as laws and policing. Probably some incentives could be created to educate the users and teach them safer computer practices. For example, to make people think more carefully about following links in email (phishing!) one could send fake phishing emails; if the user clicks on a link he gets on a page that informs him that this could have been trap and to always enter the URL directly into the browser address bar. It’s possible to track who clicked and who didn’t with specially crafted URLs in the emails. Similar things could be done with harmless executable attachments. I think this is a direction that should be pursued.