The Fundamental Clustering Problem Suite (Ultsch, 2005) contains several synthetic data sets and a variety of clustering problems that algorithms should be able to solve. The data set contains the data and ground-truth labels as well as labels for k-Means, Single-Linkage, Wards-Linkage and a Self-Organizing map. I decided to play around with it a bit, converted the SOM activation matrix to labels using k-Means, added Spectral Clustering and Self-Tuning Spectral Clustering, and an EM Gaussian Mixture model. I was particularly curious how well Spectral Clustering would do. Determining the true number of clusters from the underlying data is an entirely different problem. Hence in all cases the number of clusters was specified unless otherwise noted. The regular Spectral Clustering used the Ng-Jordan-Weiss algorithm with a kernel sigma of 0.04 after linear scaling of the inputs. The Self-Tuning Spectral Clustering used k=15 nearest neighbors. I also used Random Forest’s “clustering”, an extension of the classification algorithm that will generate a distance matrix from the classification tree ensemble. The distance matrix was then used in single linkage to obtain cluster labels. Random Forest is a special case as it derives a distance matrix out of classification using gini – a metric built for classification, not clustering.

**Results**

The Atom Data set contains two clusters, essentially two balls contained within each other. The difficulty is the difference in variance and that the classes are not linearly separable. Not surprisingly, k-Means can not separate these clusters out as it would somehow have to choose the same mean for both clusters (the center of both balls). Single linkage just works. I’m a bit puzzled why Wards Linkage fails. SOM can not separate the clusters (could this be an artifact of using k-Means to cluster the activation matrix?). Spectral Clustering was constructed specifically to deal with these kind of cases. The failure to separate the balls could stem from me using a constant kernel sigma on all data sets. Using the localized scaling of self-tuning spectral clustering just works. EM can’t model this as well due to the mean (center) of the distributions being the same. Random Forests just fail, because the distance matrix that is generated from the random permutation of the data to obtain a second class for “classification” and distance matrix generation will generate two overlapping classes.

The Chainlink Data set contains two clusters, essentially two interlocked rings that are not linearly separable. K-Means has no way of separating this as there is no point the mean could be placed resulting into a correct label assignment. Single Linkage just works and connects all the nearest neighbors to clusters. I’m not sure why Ward’s Linkage fails. SOM can not separate the clusters (could this be an artifact of using k-Means to cluster the activation matrix? In the original paper the algorithm solved this. Maybe I did something wrong). Spectral Clustering was made for these kind of manifold cases. I’m a bit surprised how well EM works in this case. I would have expected a similar result to k-Means.

The Engytime data set contains two Gaussian mixtures that are fairly close to each other. This data is primed for EM style algorithms and indeed EM performs best on this data set. K-Means performs reasonably well and picks up on the different variance. Single Linkage simply connects all the points to one big cluster and leaves two outliers to form the second clusters. I’m a bit surprised by the SOM and have no explanation why the clusters we separated the way they were. I would have thought that Gaussian Mixtures would have performed better on this one. Given that the data was generated using two Gaussians this should have been a home run for the algorithm.

Golfball contains no clusters and is one giant blob. We choose k=6 in order to see what the algorithms would do. Single linkage is the only algorithm that produces a somewhat sensible result. It connects all points to one cluster and then leaves 5 points to form the remaining clusters. It would be evident from a dendrogram that there are no clusters in this data set. All other algorithms assign labels one way or the other, most produce evenly sized areas on the ball.

Hepta contains clearly defined clusters with different variances. The clusters are clearly separated and hence all algorithms have no problems separating these clusters. SOM mislabels a few cases and I’m not sure why this is.

Lsun contains different variances and inter cluster distances. The hard part is to separate out the cluster on the bottom.

Target contains two clusters in the middle and a few outliers. The data is not linearly separable, but it’s also interesting to see how the algorithms deal with the outliers.

Tetra contains four dense, almost touching clusters.

The two diamonds data set contains two touching clusters. The cluster borders defined by density. As one would expect, single linkage fails on this data set and simply lumps everything together. Wards Linkage was made to prevent exactly this kind of problem and not surprisingly performs better. The other algorithms have no problems picking up on the two dense blobs and separate them out perfectly or close to perfect.

The Wingnut data set contains two blocks and examines the density vs. distance trade-off of the algorithms. Every method that uses distance (or something that could be interpreted as such) will have the clusters “bleed” into each other.

**Note** that my results for SOM differ a bit from Ultsch’s original paper; quite possibly I did something wrong. I haven’t figured out yet what went wrong; still… fun weekend

*References*

**Ultsch, A.**: Clustering with SOM: U*C, In *Proc. Workshop on Self-Organizing Maps, Paris, France*, (2005) , pp. 75-82

]]>

Basically, he built a classifier predicting from some innocuous (but possibly correlated variables) the likelihood of somebody having a felony offense. The classifier isn’t meant to be used in practice (from eye-balling the Precision/Recall curve in the talk slides, I estimate an AUC of about 0.6-ish; not too great), but it was built to start a discussion. It turns out that courts have upheld the use of profiling in some cases as “reasonable suspicion,” a legal standard for the police to stop somebody and investigate. This could lead to “predictive policing” being taken even further in the future. Due to the model outputting a score Jim also discusses the trade-off of where the prediction of such a model may be actionable – he calls it the Tyranny/Anarchy Trade-Off (a catchy name

Having done statistical work in criminal justice before, I think predictive analysis can be helpful in many areas of policing and criminal justice in general (e.g., parole supervision). On the other hand, I find profiling and supporting a “reasonable suspicion” from statistical models unconvincing. I think the courts will have to figure out a minimum reliability standard for such predictors, and hopefully they’ll set the threshold far higher than what the ‘felony classifier’ is producing. There’s just too many ways using a statistical model for “reasonable suspicion” to go wrong. Even if variables of protected classes (gender, ethnicity, etc.) are not used directly, there may be correlated variables (hair-color, income, geographic area) as discussed in the talk Jim gave. Even more problematic in my mind would be variables that do not or hardly ever change, as they would lead to the same people being hassled over and over again. Also the training data from which these models are built is biased since everybody in it by definition has been arrested before. It’s beyond me how one can correct for this sample bias in a reliable way. Frankly, I don’t think policing by profiling (statistical or otherwise) can be done well, and hopefully courts will recognize that eventually.

]]>

Uninstalling the AVG toolbar component finally solved the Chrome start-page problem for me.

Man, what I piece of work. I’m not alone thinking that AVG did something bad for the user here.

]]>The LendingClub website, a service offering peer-to-peer lending, offers an interesting data set: historical data of loan performance as well as data for new loans. I’ve been playing around a bit with the data and built a model to predict whether a loan is a good investment. The LendingClub data is available for download. A data dictionary can be found on the website also.

First we need to define the outcome we want to predict. A loan can be in several states, some being “current”, others being “defaulted”, “late” or even on a “performing payment plan”. Conservatively, I defined all loans that were not “paid off” as bad. Loans that are “current” were excluded as they still can default in the future. Loans that are “late” are considered bad, because the borrower run into problems. The model I’m trying to built is basically for a conservative investor looking for loans that will simply be paid back without a hitch. With the usual statistical techniques a model can be built and the performance can be measured by 10-fold cross-validation or evaluating the model on a hold-out set. The real result of a prediction will of course only be available after about 3 years when a loan is fully paid off. As measure to optimize I chose the AUC metric. A 10-fold cross-validation estimates the performance of my model at 0.698 which is not too bad. The predictions implicitly make a few assumptions. The first one being that future performance of loans will be similar to historical performance of similar loans. I’m assuming a stationary distribution and the IID assumption – which is not completely true in reality, but hopefully close enough Also, inflation expectations were not taken into account, but I’m limiting my model to 36 month loans to make that more manageable.

I won’t go into the details of how I encoded the variables and what variables I’m using. I discovered that I can extract information out of the textual variables in the loans. The “Loan Description”, a free text field where potential borrowers can leave comments or answer questions, is quite predictive. The difficult part is using that information in practice. A loan is in “funding state” for two weeks were investors can ask questions and invest in the loan. Many loans get fully funded before the two week period is over, some without any question or comment on the loan. New information may become available in the Loan Description field that may change the classification. That means, however, that the prediction may change over time – positively or negatively – after an investment decision has been. Not ideal, but the variables are quite powerful so I’m still looking for a good solution.

I made the ratings for the LendingClub loans my program produces public. I will update them occasionally (i.e., whenever I feel like it). If you have some suggestions on how to use the textual variables, leave a comment.

]]>puzzlingOutcomesInControlledExperiments.pdf

Summary:

- Don’t make changes to your application if your average customers lifetime value will decline. Understand the change, consider alternative hypothesis, watch several metrics. Ensure that your findings align with the long term strategy so that long term growth is not sacrificed for short term financial gain. Example: one time Bing had a bug, which served poor search results, so distinct queries went up 10% and CTR on advertisements went up 30%.
- Ensure that your statistic results are trustworthy. Incorrect results may cause bad ideas to be deployed; good ideas may be ruled out by mistake.
- An upwards trend in a newly launched feature does not imply that users like the feature more. (delayed effect & primacy effect).
- Often running an experiment longer does not provide extra statistical power. Pick a duration and stick to it. Do not stop tests early (unless you use algorithms to tell you when you have statistical confidence enough to be able to stop your test)
- Re-run your experiment again if you get surprising results. Investigating the underlying reasons is often worth it.
- Watch for Carryover Effect… Run A/A experiments. If you use bucketing techniques to assign participants to experiments rerun the exerpiment with a larger test group and with local randomization.