Funny text ads in Gmail

April 25th, 2007

I just had a look at my spam-folder in my gmail account, and noticed that Gmail shows “Spam-Recipes” as advertisements (“Savory Spam Crescents”). Amusing…

Copyright law

April 9th, 2007

I found one most insightful piece on copyright law on youtube using the Amen break as an example.

Caffeine Rave party (by Triad Dragons) …

April 4th, 2007

Last Saturday (3/31/07) I went to the Caffeine Music and Arts Festival Rave party (by Triad Dragons) in Littleton (Denver). The event started at 8PM and was supposed to go until 5am in the morning. This must have been the worst organized event I have ever been to. I bought tickets online, arrived at 9PM and stood with thousands of other people in line in the cold. Everybody had their ticket already and it took 2 hours of standing in line to be able to see the entrance. It was cold, and everybody (including me) was wearing thin stuff like t-shirts etc. This is ridiculous. At 11PM I left as it probably would have taken another hour or so (given the speed the line was moving) to get to the entrance from where I was. Given that they expected a couple of thousand people to attend, they had (from what I could see from where I was) only two (2!) people to pat down visitors for drugs and weapons. Maybe the rest was on break or whatever, but one would expect that they can afford enough staff for the event for $35/ticket. I will most certainly not attend the Caffeine Festival 2008, and I will let anybody (who wants to hear about it) know about how things went in 2007. I am upset …

Back from AISTATS 2007

March 27th, 2007

Just got back home from AISTATS (Artificial Intelligence and Statistics). The conference was really interesting (more so than NIPS) and it’s unfortunate that it is only every two years. Some of the invited talks were way over my head, but I learned a lot from other people’s work and got new ideas …

Some of the coolest papers were (incomplete list and in no particular order; I need to organize my notes 🙂 But there were way more papers of interest to me than at NIPS):

  1. Nonlinear Dimensionality Reduction as Information Retrieval
    • Venna Jarkko and Samuel Kaski
  2. Fast Low-Rank Semidefinite Programming for Embedding and Clustering
    • Brian Kulis, Arun Surendran, and John C. Platt
  3. Local and global sparse Gaussian process approximations
    • Edward Snelson, Zoubin Ghahramani
  4. A fast algorithm for learning large scale preference relations
    • Vikas Raykar, Ramani Duraiswami, and Balaji Krishnapuram
  5. Deep Belief Networks
    • Ruslan Salakhutdinov and Geoff Hinton
  6. Large-Margin Classification in Banach Spaces
    • Ricky Der and Daniel Lee

One thing that couldn’t help but notice was how much research is now focusing on Semi-Definite Programs, either for dimensionality reduction or other purposes. Yet, there are not many efficient ways to compute SDPs. One paper presented a method based on quasi-Newton gradient descent, but it’s probably not good enough yet for large problems.

Other interesting papers I saw was about the unsupervised deep belief nets that learns a structure of the data which results in an interesting performance boost. The authors train a deep belief net (unsupervised) on the data and then train classifiers on the output; although all the results were compared to only linear techniques, they showed some impressive results. This reminded me of a similar idea I had a while ago that I never got to work; I tried to use label propagation methods to approximate a kernel matrix usable for SVMs and the like. It never worked, because my algorithm caused the SVMs to always overfit (despite being unsupervised – it took me a while to realize that doing something unsupervised is no guarantee that you won’t overfit your data). I’ll investigate some day what made all the difference in this case…

Another interesting bit was that approximating the Matrix Inverse by low-rank approximations leads to significant loss of accuracy for Gaussian Processes Error bars. This should be interesting for further research in the speedups for these and other algorithms that require a matrix inversion (e.g. semi-supervised label propagation algorithms).

eWRT Project closing…

March 19th, 2007

The eWrt-linux project is closing 🙁 Sad, I had a lot of fun modding my little Linksys router and even contributed a few patches. Hopefully somebody will volunteer to continue the project on Sourceforge. I wish I had the time…

Comment Spam and more Spam …

March 11th, 2007

When I started my blog I was already aware about the Comment Spam problem and thus enabled a WordPress plugin to prevent comment spam (“did you pass math”). The other day a friend complained that when he wanted to comment on something and forgot to fill out the captcha-field his comment got lost (and pushing the back-button had his browser loose all that he had typed up). And when I was reading through raw apache logs and saw somebody trying to post a comment and apparently not succeeding. So I turned the plugin off and within a day I had 8 spam comments on my blog (which does not have a high pagerank and uses nofollow-links; What’s the gain?)… So I’ll keep it turned on. There!

Spam is an interesting problem, because you have an “adversary” with a lot of resources who will do whatever it takes to get your attention, an email in your inbox or a comment with links on your blog. The more filters we build, even with machine learning, the more sophisticated they become. It will probably be a driving force for classification for some time to come. However, machine learning and filters are very expensive in CPU time and do not scale very well. Sander told me about the email server at their institute having a backlog in emails of 40 Gigabytes, i.e. 40 Gigabytes of emails staying in the spool waiting to be scanned for spam and virii. Given that this server was only serving about 50 users and given that 99% of the email in the spool is probably spam illustrates the problem. Currently (in my opinion) mechanisms like Grey-Listing and such are a better solution simply because they scale better as they exploit “implementation issues” of the spam-software and don’t require the CPU-intensive scan of every email. That is, until the next generation of Spam-bots will adapt to those measures. Build a better spam-filter and somebody will build a better spam.

Artificial Intelligence Cited for Unlicensed Practice of Law

March 8th, 2007

I just read an article in the Wired blog titled “AI Cited for Unlicensed Practice of Law” citing a ruling from a court upholding its decision that the owner through the expert system he developed has given unlicensed legal advise. While an expert system is a clear cut case (as the system always does exactly what it was told [minus errors in the rules]; it just follows given rules and makes logical conclusions), this becomes more interesting in cases in which the machine learns or otherwise modifies its behavior over time. For example, lets say I put an AI software online that interacts with people and learns over time. Should I be held responsible if the program does something bad? What if I was not the person that taught it that particular behavior? This will probably be a topic that the courts will have to figure out in the future. For one, people should not be able to hide behind actions their computer has done. But what if it is reasonably beyond the capability of the individual to forsee what the AI has done?

This will probably end up being the next big challenge for courts just like the internet has been. It is interesting how the internet has created legal problems just with people being able to communicate more easily with each other: think trademark issues, advertising restrictions for tobacco or copyright violations (fair use differs from country to country; what is legal in one might be illegal in another) …

Update: And it just started. Check out this article: Colorado Woman Sues To Hold Web Crawlers To Contracts

“I’m sorry, but I’m married …”

March 5th, 2007

As I enjoy going out with friends and mingling a lot, I noticed a very interesting trend lately. Some of my single friends will go and chat up some woman they find attractive (and sometimes with success) and if the woman is not interested, she will tend to show them a ring on her finger and tell them that she unfortunately is taken. So far, so good. I was out with some friends of mine. We were just talking and observed some gentleman asking a woman out right of the bat. She showed her ring and politely turned the guy down. Chris ended up talking to this lovely lady later – we all met as part of some group going out, and Chris and she had a longer conversation. After a long conversation, lots of laughter she excused herself for a minute. Once she came back from the bathroom, her ring was gone. He didn’t notice it at first, but as she suddenly became a lot more flirty, he asked her about it straight up. Her answer: “It’s a fake ring. Just to keep guys in bars from hitting on me.” They talked for a bit more about this and that, and she started hinting more and more that she would be very interested in going on a date with Chris. He thought about it, and walked away. Chris told me later that he just does not like to date woman that lie; trust and honesty are important to him. His reasoning was that if she is willing to lie (or: use little “white lies”) right from the start, how can one expect that it does not get worse over time? What if she uses little white lies to get out of every situation she does not want to be in?

What do we learn from this? A fake wedding ring might keep all the drunk loosers away, but will possibly confuse or scare off Mr. Right when he comes along. I’ve heard similar stories from a couple of other guys; it just does not go over well.

Validating patterns found by Data Mining techniques

February 22nd, 2007

I just read in the news about a study that showed with data mining that libras live longer (study from the Institute for Clinical Evaluative Sciences in Toronto). These guys did a study for fun using data from 10 million Ontario residents looking for associations between various health problems and their astrological signs. And they actually found associations! Each of the twelve astrological signs had at least two medical disorders associated with them. However, they were not able to replicate it when looking for the same pattern in the hold-out set.

This is a nice example of why it is important to have a hold-out set, or ideally to try to redo a study with a different method on a different set of similar data. A couple of other methods for validating have been proposed which are harder to do. In my data mining work with Tim we have used a cross-replication and cluster validation design from the book “Classification” by Gordon (1999). The original method (I think) was proposed by McIntyre and Blashfield (“A nearest-Centroid technique for evaluating the minimum variance clustering procedure”, Multivariate Behavioral Research 15, 1980; see also Milligan 1996 “Clustering Validation: results and implications for applied analysis, in “Clustering and Classification”; eds. Arabie, Hubert and De Soete, World Scientific, Singapore, p. 341). What I particularly like about the method is that it can be used to quantify in numbers “how much” of a replication you got. The method works like this:

  • Divide the data randomly into two subsets
  • Do your exploratory clustering work on set A, partitioning it into k many clusters
  • Use a classifier (Nearest Neighbor for example) to assign each object in B to each discovered cluster
  • Use the same clustering procedure on B, partitioning it into k many clusters
  • Compare the two labelings obtained for B (the classification and clustering) in a cross-tabulation, compute a kappa-coefficient

Given our tendency to find patterns in data (or see shapes in clouds) I think it is important to use a procedure like the above to double-check the patterns discovered before any important decisions are made.

EM-Clustering when double isn’t enough…

February 13th, 2007

One of the more interesting developments in clustering (in my opinion) is clustering of data the is on a unit hypersphere. It sounds like some rare special case at first, but appears quite frequently in real life applications such as Spectral Embeddings in Spectral Clustering, some subdomains of Bio-Informatics data or text-data in TFIDF representation. The data can be analyzed as unit vectors on a d-dimensional hypersphere, or equivalently are directional in nature. Spectral clustering techniques generate embeddings in the normalization step that constitute an example of directional data and can result in different shapes on a hypersphere.

The first paper published that suggested a good clustering algorithm presented an Expectation Maximization (EM) algorithm for the von Mises-Fisher distribution (Banerjee et.al, JMLR (6) 2005). Avleen and myself started to work on extension for this that utilizes the Watson distribution, a distribution for directional data that has more modeling capability than the simple von Mises-Fisher distribution. We just published our results for the Watson EM clustering algorithm in the AISTATS 2007 conference to be held in March (Matlab code will be available soon).

One problem with both algorithms is that they require a high precission number representation in order to work well for high-dimensional problems such as bio-informatics data and text. Most prior work with directional data was limited to maybe 3 dimensional cases, and most Kummer-function approximations (another problem we had to address) work well only for the lower dimensional cases. In our AISTATS paper we only presented results for lower dimensional embeddings as we had some problems getting it to work for higher dimensional data (also, the root-solver that was involved is just incapable of handling larger problems). We have been working on a speedup with some success, but I have to say that it was mostly the numerical problems that gave us a hard time.

More and more Machine Learning techniques require a more careful consideration of numerical problems (Support Vector Machines, my manifold clustering algorithm etc.) and I run into numerical problems every other day. While trying to improve our Watson-EM algorithm I found out that Continued Fractions have many desirable properties such as the unique, finite representation of rational numbers. Numbers can be represented exact with no numerical error. In the EM algorithm we use them to approximate the Kummer function. Maybe a more exact number representation for fractions can be made out of this?

So I started looking and found in an tech-report that the number representation of continued fractions can be nicely implemented in Prolog. It also explains how to add up numbers in a Continued Fraction Representation and so on with an arbitrary precision.

I haven’t found any papers yet that suggest a suitable hardware implementation of Continued Fractions to replace the IEEE floating point numbers we use nowadays, but it can probably be done.