Archive for the ‘Machine Learning’ Category

Machine Learning Newsticker

Saturday, April 2nd, 2016

Shameless plug: I’ve fixed my scripts and my Machine Learning Newsticker (that reaches back to February 2005) is back in business. RSS Link Expect periodic updates on ML and AI related news items.


Strengths and Weaknesses of various Clustering Algorithms

Sunday, April 27th, 2014

Clustering is one of the most fascinating areas in the data mining field for me. For example, it can be shown that under some simple and fairly straightforward it is impossible to find a clustering function (Kleinberg’s impossibility theorem for clustering).That means that some trade-offs and relaxations have to be made in all the existing clustering algorithms. Every clustering algorithm will have a weakness and fail to cluster some data sets in a sensible manner.

The Fundamental Clustering Problem Suite (Ultsch, 2005) contains several synthetic data sets and a variety of clustering problems that algorithms should be able to solve. The data set contains the data and ground-truth labels as well as labels for k-Means, Single-Linkage, Wards-Linkage and a Self-Organizing map. I decided to play around with it a bit, converted the SOM activation matrix to labels using k-Means, added Spectral Clustering and Self-Tuning Spectral Clustering, and an EM Gaussian Mixture model. I was particularly curious how well Spectral Clustering would do. Determining the true number of clusters from the underlying data is an entirely different problem. Hence in all cases the number of clusters was specified unless otherwise noted. The regular Spectral Clustering used the Ng-Jordan-Weiss algorithm with a kernel sigma of 0.04 after linear scaling of the inputs. The Self-Tuning Spectral Clustering used k=15 nearest neighbors. I also used Random Forest’s “clustering”, an extension of the classification algorithm that will generate a distance matrix from the classification tree ensemble. The distance matrix was then used in single linkage to obtain cluster labels. Random Forest is a special case as it derives a distance matrix out of classification using gini – a metric built for classification, not clustering.



Detecting Click Fraud in Online Advertising

Sunday, March 2nd, 2014

JMLR has an interesting paper summarizing the results from a contest to build the best model for ClickFraud detection. The second place entry described some nice feature engineering that I found interesting. The first place did feature selection and then used gbm, a really good ensemble algorithm.


Cross-VM Side Channels and Their Use to Extract Private Keys

Sunday, October 28th, 2012

Cool application of machine learning in the security field: extracting private keys from virtual machines running on shared hardware by training a Support-Vector-Machine model to classify data bits collected.

Classification with inputs that change over time – P2P Loan Data

Saturday, October 6th, 2012

Predicting whether a loan will default or not is a tricky task. It may involve many variables, incomplete information and is a task that involves time as a component. Loans may also perform for a while before they default. Some loans may even be late, but recover back to the regular payment schedule. It’s an interesting application for statistics.

The LendingClub website, a service offering peer-to-peer lending, offers an interesting data set: historical data of loan performance as well as data for new loans. I’ve been playing around a bit with the data and built a model to predict whether a loan is a good investment. The LendingClub data is available for download. A data dictionary can be found on the website also.

First we need to define the outcome we want to predict. A loan can be in several states, some being “current”, others being “defaulted”, “late” or even on a “performing payment plan”. Conservatively, I defined all loans that were not “paid off” as bad. Loans that are “current” were excluded as they still can default in the future. Loans that are “late” are considered bad, because the borrower run into problems. The model I’m trying to built is basically for a conservative investor looking for loans that will simply be paid back without a hitch. With the usual statistical techniques a model can be built and the performance can be measured by 10-fold cross-validation or evaluating the model on a hold-out set. The real result of a prediction will of course only be available after about 3 years when a loan is fully paid off. As measure to optimize I chose the AUC metric. A 10-fold cross-validation estimates the performance of my model at 0.698 which is not too bad. The predictions implicitly make a few assumptions. The first one being that future performance of loans will be similar to historical performance of similar loans. I’m assuming a stationary distribution and the IID assumption – which is not completely true in reality, but hopefully close enough 🙂 Also, inflation expectations were not taken into account, but I’m limiting my model to 36 month loans to make that more manageable.

I won’t go into the details of how I encoded the variables and what variables I’m using. I discovered that I can extract information out of the textual variables in the loans. The “Loan Description”, a free text field where potential borrowers can leave comments or answer questions, is quite predictive. The difficult part is using that information in practice. A loan is in “funding state” for two weeks were investors can ask questions and invest in the loan. Many loans get fully funded before the two week period is over, some without any question or comment on the loan. New information may become available in the Loan Description field that may change the classification. That means, however, that the prediction may change over time – positively or negatively – after an investment decision has been. Not ideal, but the variables are quite powerful so I’m still looking for a good solution.

I made the ratings for the LendingClub loans my program produces public. I will update them occasionally (i.e., whenever I feel like it). If you have some suggestions on how to use the textual variables, leave a comment.

How Kinect body tracking works and how Machine Learning helped

Saturday, March 26th, 2011

Microsoft Research has published a paper explaining how the Kinect body tracking algorithm works [PDF]. This video shows how it all comes together. They trained a variation of Random Forests on the various pre-labeled images to identify the various body parts from a normal RBG camera and a depth-camera. The way they create many more training images from previously captured data is also interesting. The final system can run at 200 frames per second and it doesn’t need an initial calibration pose. Very interesting…

Elo Scores and Rating Contestants

Thursday, August 5th, 2010

Kaggle has a new and interesting competition on building a chess rating algorithm that performs better than the official Elo rating system currently in use. Entrants build their own rating systems based on the results of more than 65,000 historical chess games and then test their algorithms by predicting the results on a holdout set of 7,800 games.

Looks like an interesting problem. The only other thing that comes to my mind literature-wise is that Microsoft built and published their TrueSkill(tm) Ranking System for the XBox in order to match players with similar skills in online games. In the original paper at NIPS, the authors had shown that TrueSkill outperformed Elo.

GraphLab & Parallel Machine Learning

Sunday, July 11th, 2010

Interesting article:  GraphLab: A New Framework for Parallel Machine Learning

From the abstract:

Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. We demonstrate the expressiveness of the GraphLab framework by designing and implementing parallel versions of belief propagation, Gibbs sampling, Co-EM, Lasso and Compressed Sensing. We show that using GraphLab we can achieve excellent parallel performance on large scale real-world problems.

Given all the talk about Map-Reduce, Hadoop etc. this seems like a logical next step to make scaling data mining to large data sets a lot easier.

Alternative measures to the AUC for rare-event prognostic models

Tuesday, February 16th, 2010

How can one evaluate the performance of prognostic models in a meaningful way? This is a very basic and yet an interesting problem especially in the context of prediction of very rare events (base-rates <10%). How reliable is the model’s forecast? This is a good question and of particular importance when it matters – think criminal psychology where models forecast the likelihood of recidivism for criminally insane people (Quinsey 1980). There are a variety of ways to evaluate a model’s predictive performance on a hold out sample, and some are more meaningful than others. For example, when using error-rates one should keep in mind that they are only meaningful when you consider the base-rate of your classes and the trivial classifier as well. Often this gets confusing when you are dealing with very imbalanced data sets or rare events. In this blog post, I’ll summarize a few techniques and alternative evaluation methods for predictive models that are particularly useful when dealing with rare events or low base-rates in general.

The Receiver Operator Characteristic is a graphical measure that plots the true versus false positive rates such that the user can decide where to cut for making the final classification decision. In order to summarize the performance of the graph in a single, reportable number, the area under the curve (AUC) is generally used.


Spam Filtering by Learning a Pattern Language

Tuesday, January 26th, 2010

The New Scientist describes a new method for spam detection by learning patterns. This new method exploits the spammers most powerful weapon – the automatic generation of many, similar messages by automated means (i.e., some grammar in a formal language) – and turns it against them. The article reports that a pattern can reliably be learned from about 1000 examples captured from a bot, allowing the method to classify new messages accurately and with zero false positives. This sounds really exciting given my full spam-folder.

However, I’m a bit cautious. The article is a bit sparse on technical details, so I might make some wrong assumptions here. First, zero false positives reported is the discrimination of spam from that particular spam-grammar versus other messages. At least that’s how I understand it. Second, it seems from the article that they only learn from positive examples. Overall the technique sounds to me like they are learning a pattern language. Pattern languages are a class of grammars that overlap with linear and context-sensitive grammars (Chomsky hierarchy). Unfortunately they don’t have a real Wikipedia page so I’ll try to give a bit of background. The closest I can give for an example right now would be regular expressions with back-references. I’m not sure if this is an accurate description for all possible patterns, but it’s close enough for an example.

I don’t know how the specific technique mentioned in the article works in detail, but I’ve learned two things about learning grammars from text: (a) we can’t learn all linear or context-sensitive languages, only all pattern language grammars; (b) learning patterns without negative examples leads to over-generalization really really fast.

While I haven’t worked with learning grammars in a long while, the only algorithm of which I’m aware is the Lange-Wiehagen algorithm (Steffen Lange and Rolf Wiehagen; Polynomial-time inference of arbitrary pattern languages. New Generation Computing, 8(4):361-370, 1991). This algorithm is not a consistent learner, but can learn all pattern languages in polynomial time. There might be better ones available by now, but learning grammars is not that popular in the machine learning community right now. I’m sure there are some other interesting applications besides spam filtering. Maybe it’s time for a revival.

Overall, it sounds like a promising new anti-spam technique, but I’d like to see some more realistic testing done. There are some obvious ways for spammers to make learning these patterns harder, but either way I’m curious – maybe the inventors of this technique discovered a better way to learn patterns? Maybe by using some problem-specific domain knowledge?