Chromebook Reboot Loop

December 2nd, 2014

My Chromebook (a Samsung XE500C21) ended up stuck in a reboot loop. Various keyboard shortcuts I found online didn’t work and I ended up having to wipe the disk and restore from a recovery image. Fortunately all data was in Google Drive anyways. I had to perform a Chromebook hard reset to get to the recovery screen. I wrote down the model number and then followed the recovery instructions for using a USB stick: Chromebook Recovery Image . I used the Linux based method. The tool found a suitable boot image, wrote it onto the designated USB stick. I did another hard reset and then inserted the USB stick with the image. The image installed itself without further intervention and after a reboot everything was fixed. All my settings, Chrome extensions etc. were synced to the machine after the first login. The only thing not synced was the wallpaper. Oh well…



Random Number Generation on low entropy computer systems

October 13th, 2014

Lately, I got interested in random number generation. Generating random numbers securely so they are suitable for cryptographic use is hard.

Linux introduced /dev/random, a block device to a random number generator in the kernel. The generator keeps an estimate of the number of bits of noise in the entropy pool. From this entropy pool, random numbers are created by using a secure hash function on the data in the pool . When read, the /dev/random device will only return random bytes within the estimated number of bits of noise in the entropy pool. The Operating System adds entropy from all sorts of sources, such as disk latency. However, if you are running Linux on an embedded device (or “internet of things”) or in a cloud instance, there may not be a “real” disk drive, and whatever latency is measured may not contain good enough entropy. Another important source of entropy is the clock-cycle timer (rdtsc) that is called at various instances to measure the time between interrupts and other frequently and somewhat randomly occurring events. The rdtsc instruction is often virtualized in cloud-machines which could make it harder to get good entropy.  Anyways… the problem of getting random numbers in low-entropy situations is important enough that Intel added an on-chip random number generator to their new processors (rdrand), but many cloud instances emulate older CPUs and embedded system don’t necessarily run with x86 chips that have rdrand.

So far, so good. For now, I don’t have an embedded system to experiment with, but a similar problem is with servers in the cloud. Can the rdtsc instruction still be used to get entropy in a virtual machine? I found this, played around a bit with the program listed and also learned about the existence of haveged (I’ll get back to that). I started a micro-instance (1 CPU, 512Mb RAM) in a popular cloud service and found that, yes indeed, the rdtsc instruction seems to be virtualized. The average tick count between millions of two consecutive rdtsc instructions is far too small to account for whatever else is going on on the machine (unless, of course, I somehow got a 512MB machine all by myself – which is unlikely). On the other hand, the output of the least significant bit of the difference of two consecutive rdtsc calls looked pretty random, but didn’t pass randomness tests. Adding von-Neumann whitening on the LSB obtained from two rdtsc calls is actually sufficiently random to pass FIPS 140-2 randomness tests (I used the implementation in rngtest) and only fails about 1 in 1000 (comparison: a test-run with entropy from /dev/random failed 1:10000). So in theory, this should be okay to use as an entropy source, but maybe it should still be combined with other sources of entropy.

Coming back to other already existing solutions. It turns out there’s an entropy gathering demon called haveged that can add entropy to the pool to remedy low-entropy conditions that can occur in servers, but possibly also on embedded devices and in cloud instance. The method exploits the modifications of the internal volatile hardware states as a source of uncertainty. Modern superscalar processors feature various hardware mechanisms which aim to improve performance: caches, branch predictors, TLBs and much more. The state of these components is also volatile and cannot be directly monitored in software. On the other hand, every invocation of the operating system modifies thousands of these binary volatile states. The general idea of extracting entropy from rdtsc still works, however the only thing that still makes me wonder a bit is that the full timer (not just the LSB) is used and that the whitening method is rather complex.

I learned a lot, but there’s still more to explore. Cloud machines probably have more going on than embedded systems, so I’m still not convinced that this will work on embedded devices. I’ll try to wrap my head around the whitening function in haveged. It’s still being developed and maybe ARM support will be added some day. Then the demon could be used on smart phones to improve the security of the random number generation.



Strengths and Weaknesses of various Clustering Algorithms

April 27th, 2014

Clustering is one of the most fascinating areas in the data mining field for me. For example, it can be shown that under some simple and fairly straightforward it is impossible to find a clustering function (Kleinberg’s impossibility theorem for clustering).That means that some trade-offs and relaxations have to be made in all the existing clustering algorithms. Every clustering algorithm will have a weakness and fail to cluster some data sets in a sensible manner.

The Fundamental Clustering Problem Suite (Ultsch, 2005) contains several synthetic data sets and a variety of clustering problems that algorithms should be able to solve. The data set contains the data and ground-truth labels as well as labels for k-Means, Single-Linkage, Wards-Linkage and a Self-Organizing map. I decided to play around with it a bit, converted the SOM activation matrix to labels using k-Means, added Spectral Clustering and Self-Tuning Spectral Clustering, and an EM Gaussian Mixture model. I was particularly curious how well Spectral Clustering would do. Determining the true number of clusters from the underlying data is an entirely different problem. Hence in all cases the number of clusters was specified unless otherwise noted. The regular Spectral Clustering used the Ng-Jordan-Weiss algorithm with a kernel sigma of 0.04 after linear scaling of the inputs. The Self-Tuning Spectral Clustering used k=15 nearest neighbors. I also used Random Forest’s “clustering”, an extension of the classification algorithm that will generate a distance matrix from the classification tree ensemble. The distance matrix was then used in single linkage to obtain cluster labels. Random Forest is a special case as it derives a distance matrix out of classification using gini – a metric built for classification, not clustering.


Read the rest of this entry »

Detecting Click Fraud in Online Advertising

March 2nd, 2014

JMLR has an interesting paper summarizing the results from a contest to build the best model for ClickFraud detection. The second place entry described some nice feature engineering that I found interesting. The first place did feature selection and then used gbm, a really good ensemble algorithm.


Big Data, Predictive Policing and the Tyranny/Anarchy Trade-Off

August 17th, 2013

Bloomberg had an interesting article today about a talk on the implications and privacy trade-offs of predictive policing and profiling. Jim Adler’s “felon classifier” is also described in his blog.

Basically, he built a classifier predicting from some innocuous (but possibly correlated variables) the likelihood of somebody having a felony offense. The classifier isn’t meant to be used in practice (from eye-balling the Precision/Recall curve in the talk slides, I estimate an AUC of about 0.6-ish; not too great), but it was built to start a discussion. It turns out that courts have upheld the use of profiling in some cases as “reasonable suspicion,” a legal standard for the police to stop somebody and investigate. This could lead to “predictive policing” being taken even further in the future. Due to the model outputting a score Jim also discusses the trade-off of where the prediction of such a model may be actionable – he calls it the Tyranny/Anarchy Trade-Off (a catchy name :)

Having done statistical work in criminal justice before, I think predictive analysis can be helpful in many areas of policing and criminal justice in general (e.g., parole supervision). On the other hand, I find profiling and supporting a “reasonable suspicion” from statistical models unconvincing. I think the courts will have to figure out a minimum reliability standard for such predictors, and hopefully they’ll set the threshold far higher than what the ‘felony classifier’ is producing. There’s just too many ways using a statistical model for “reasonable suspicion” to go wrong. Even if variables of protected classes (gender, ethnicity, etc.) are not used directly, there may be correlated variables (hair-color, income, geographic area) as discussed in the talk Jim gave. Even more problematic in my mind would be variables that do not or hardly ever change, as they would lead to the same people being hassled over and over again. Also the training data from which these models are built is biased since everybody in it by definition has been arrested before. It’s beyond me how one can correct for this sample bias in a reliable way. Frankly, I don’t think policing by profiling (statistical or otherwise) can be done well, and hopefully courts will recognize that eventually.


Removing AVG Secure Search from Chrome

May 11th, 2013

I use AVG as my anti-virus and today it installed the whole safe-search toolbar thing in the background while I was playing a video game. I’m pretty sure I didn’t consent to that, but whatever… Undoing the damage to my settings took quite a bit of work and I had trouble removing it from Chrome. First I followed all the steps outlined here, but pressing the home button in Chrome would still bring up the  “” website no matter what settings I changed.

Uninstalling the AVG toolbar component finally solved the Chrome start-page problem for me.

Man, what I piece of work. I’m not alone thinking that AVG did something bad for the user here.

On Statistical Significance Testing

March 26th, 2013

I found a good read about the fallacies of statistical significance testing. See also here.

Frequentist vs. Bayesian Inference

November 19th, 2012

I found a good article discussing the difference between the Frequentist and Bayesian approach to Inference.

Cross-VM Side Channels and Their Use to Extract Private Keys

October 28th, 2012

Cool application of machine learning in the security field: extracting private keys from virtual machines running on shared hardware by training a Support-Vector-Machine model to classify data bits collected.

Classification with inputs that change over time – P2P Loan Data

October 6th, 2012

Predicting whether a loan will default or not is a tricky task. It may involve many variables, incomplete information and is a task that involves time as a component. Loans may also perform for a while before they default. Some loans may even be late, but recover back to the regular payment schedule. It’s an interesting application for statistics.

The LendingClub website, a service offering peer-to-peer lending, offers an interesting data set: historical data of loan performance as well as data for new loans. I’ve been playing around a bit with the data and built a model to predict whether a loan is a good investment. The LendingClub data is available for download. A data dictionary can be found on the website also.

First we need to define the outcome we want to predict. A loan can be in several states, some being “current”, others being “defaulted”, “late” or even on a “performing payment plan”. Conservatively, I defined all loans that were not “paid off” as bad. Loans that are “current” were excluded as they still can default in the future. Loans that are “late” are considered bad, because the borrower run into problems. The model I’m trying to built is basically for a conservative investor looking for loans that will simply be paid back without a hitch. With the usual statistical techniques a model can be built and the performance can be measured by 10-fold cross-validation or evaluating the model on a hold-out set. The real result of a prediction will of course only be available after about 3 years when a loan is fully paid off. As measure to optimize I chose the AUC metric. A 10-fold cross-validation estimates the performance of my model at 0.698 which is not too bad. The predictions implicitly make a few assumptions. The first one being that future performance of loans will be similar to historical performance of similar loans. I’m assuming a stationary distribution and the IID assumption – which is not completely true in reality, but hopefully close enough :-) Also, inflation expectations were not taken into account, but I’m limiting my model to 36 month loans to make that more manageable.

I won’t go into the details of how I encoded the variables and what variables I’m using. I discovered that I can extract information out of the textual variables in the loans. The “Loan Description”, a free text field where potential borrowers can leave comments or answer questions, is quite predictive. The difficult part is using that information in practice. A loan is in “funding state” for two weeks were investors can ask questions and invest in the loan. Many loans get fully funded before the two week period is over, some without any question or comment on the loan. New information may become available in the Loan Description field that may change the classification. That means, however, that the prediction may change over time – positively or negatively – after an investment decision has been. Not ideal, but the variables are quite powerful so I’m still looking for a good solution.

I made the ratings for the LendingClub loans my program produces public. I will update them occasionally (i.e., whenever I feel like it). If you have some suggestions on how to use the textual variables, leave a comment.