Markus Breitenbach

The 80% Rule for ML Results

Markus — Sun, 26 May 2019 19:41:30 +0000

One remarkable thing that easily deceives us with machine learning is that how quickly one can see good, encouraging results.

When developing a new algorithm during my time in grad school I often had some bug in the implementation. The results despite the bug were still good, but not as good as the final, bug free version (as far as we know). Despite bugs, the system was learning and doing something.

When training a new model using data from a system that has a bug we still get a working model. Removing the bugs will often result in sizable performance improvements of the model, more than adding a new feature would have.

When starting new projects in my experience we quickly see great results. Often the simplest thing gets us about 80% to where we would have a new product or a launch-able new model. Demos often look interesting, the system doing something right and a few funny cases wrong. Often people become excited and expectations rise for when this new thing may ship.

I feel it’s important to test carefully, have a metric and specific criteria for launch that measure performance against a ground truth. This may sound obvious, but many new techniques (GANs etc.) there’s often no ground truth and performance is eyeballed. Again, the first 80% are easy. The remaining percentage points will take a while…

Deferring Decisions to AI

Markus — Wed, 14 Feb 2018 05:02:01 +0000

Would you let an AI make decisions for you? With all the buzz about AI and calls for its regulation (apparently AI will kill us all some day), I’ve been thinking about under what circumstances we might delegate decision making in certain areas to machines (call it AI). That’s not as crazy as it sounds, because mankind has been doing that for years already.

We’re deferring decisions on who gets to drive next at intersections to traffic lights. You may argue that that’s not AI, but there are several newer models that control the flow of traffic in clever ways, use various methods of detecting vehicles waiting and more. Not only are we deferring the decision there, we even automatically punish people that run red lights (photo red light cameras) and more so if harm to others is caused.

Credit scoring is, in general, a risk estimate for non-payment of the loans. In the past, credit reporting has been a deeply personal practice. Companies secured loans by asking well-regarded neighbors to vouch for the character of the borrower. These reports were obviously incredibly subjective as outlined here.

Today’s aircraft pilots are in charge of take-off and landing, and spend the rest of their time supervising the flight while the computer is in control of the trajectory of an aircraft. In general, autopilot is like an additional crew member helping out, allowing the pilot to keep an eye on the higher level things (oil pressure, engine, or check for better winds or a smoother altitude). Before autopilot (see here), the job of a pilot was way more exhausting, probably increasing the risk of accidents. Autopilot dates back to the early 1920s, and nowadays we probably can’t imagine flying without it.

These are some good examples of when decisions have been delegated to machines successfully. If you read the news you probably come across articles discussing self driving cars, discussing a variety of hypothetical scenarios of whether the cars should be able to solve moral dilemmas deciding who to run over if the brakes fail. Frankly I think these scenarios are too far fetched, and I doubt most humans are able to make good split second decisions in these scenarios either.

An interesting question now is who to believe in critical situations — do you listen to humans, or do you listen to the machine? That’s not easy to answer. For example, the Überlingen mid-air collision could have likely been avoided had the pilots followed the instructions of the automatic collision avoidance system instead of ground control.

Another domain to consider is psychological testing for mental health problems. Mental health problems are harder to diagnose than people give professionals credit for. A guy named Paul Meehl has done a lot of work in this area and, as early as 1954, postulated that mechanical decision making results in better and more reliable diagnosis compared to clinical diagnosis. Meta-analyses comparing clinical and mechanical prediction efficiency have added support to the hypothesis that algorithms outperform clinical predictions (1, 2).

Can we leave supervision of patients to machines? It seems humans aren’t doing a good enough job, and the FDA just approved a system that allows for constant patient monitoring, analyzed and delivered to hospital staff in real time, to help prevent unexpected deaths in the hospital. The clinical trials look very promising.

So with all these examples where delegating decisions to machines has already been working out well, at what point should we delegate? What if there are simple systems that match or outperform human decision making in every single instance? What if only most of the time? Should we replace humans whenever machines can do the job and outperform humans? Should humans just supervise and intervene? Self-driving cars anyone? For simplicity let’s consider a simple classification problem – say, sick vs healthy. Even if the machine isn’t 100% correct or it is hard to understand why a particular prediction was made, what if human supervision (or meddling with) makes for worse outcomes? At what point would it make sense to have the machine do it all? I don’t have a good, final answer to this question, but I think a lot of will depend on human perception of the machines performance.

I find this very similar to a business trying to hire competent people. Frequently we find ourselves in the situation of having to hire lawyers, accountants and various professionals for tasks we have very limited understanding of. Sure, you can hire a family lawyer to help prepare a prenuptial agreement, for example. If you hire a poor lawyer, you won’t know until later that the agreement is deemed unenforceable in court. Bad lawyers may summarize or explain the law to you in ways that are just slightly more efficient than reading it yourself (but at least they cannot be sued themselves for giving the wrong advice), give impractical advice and so on. Analogous to any other kind of expert you may hire for an area that you have no expertise in. I’d say in this day and age we use reviews on the internet, word of mouth, letters of recommendation and reputation of the college (in hiring situations) and other heuristics to identify the competent experts to entrust our problems to. A lot of hiring in real life is more about track record and personality fit than pure ability. I can see user experience being a factor in picking and trusting an AI as well.

Some people we hire have a fiduciary duty and can be held accountable, others don’t have to follow such a standard (licensed investment advisors vs. brokers, etc.). If you have your robot assistants gadget order new soap (without specifying which), then is it going to order one that is best for you, or one with the largest profit margin for the company subsidizing the software and cloud resources? These will be questions that will require more discussion in the near future.

I haven’t thought this through 100%, but I think it would be useful to refer to past performance of AI or algorithms. We can measure the performance of AI/prediction systems, compare the AIs performance to the performance of humans, and then decide. Unlike human decision making, AIs can be interrogated, tested, audited and refined on millions of examples if needed. Statistics gives us tools to test these systems properly, and if found to be better by some margin (how much?), I think they should be used in place of humans. While many people worry about the dangers of leaving decisions to algorithms and AI, it’s important to keep in mind the instances where AI and automation is already being successfully used in our lives.

Some Thoughts On Secure Random Numbers

Markus — Sun, 30 Oct 2016 21:53:33 +0000

Many protocols and cryptographic primitives such as DSA rely on secure random numbers. For example, the Signal messenger and it’s quite intricate protocol was recently analyzed and a paper about the analyses published[1]. The authors wrote that they “do not consider faulty/malicious random number generators” in their threat model, but also state that if “[…] generated randomness is compromised or bad, Signal offers no protection”. Secure random numbers are the most common weakness. A compromised random number generator used widely even allows for bulk data collection and decryption[2].

In my mind this problem is relatively easy to avoid. For one, protocols can be modified to not to leak raw output of the random number generator, which makes practical attacks on the rng much harder. For example, instead of sending a nonce consisting of raw rng output, which would give an eavesdropper information to determine the state of the rng, one could use the sha512 of the rng output. To ensure that this is correctly implemented the protocol participants can exchange the raw rng output later once secure communication has been established and ensure that indeed the nonce that was used was sha512(rng) and not raw rng output itself. For primitives that require random numbers for operation (e.g., signing) one can use the secure hash of the message and private-key for a unique, deterministic and secure random number (this is used by ed25519, IIRC). My point is that most of these problems seem avoidable and systems can be hardened to not completely fail even if the random number generator has been compromised.

The other point is that random numbers should be generated by the OS and incorporate as many different sources as possible. Linux, OpenBSD and other operating systems offer APIs that return secure random numbers collected from various sources of entropy in the system. The latest generation of x86 chips even has RNG instructions that generate secure random numbers (caveats: [3,4]). I think the more entropy from difference sources the better (caveat: [5]), and personally I think systems like HAVEGED can be helpful as an additional source of entropy. One counter argument that I heard was that nobody knows how good this cpu entropy really is, and that entropy extracted from hard-disk latency (and similar). However, for other sources of entropy we don’t know that either. The most sensible paper showing entropy extraction from disk drive latency[6] shows that technically only for certain drives, and I’m willing to wager that all of these results will look quite differently for modern SSD drives. As far as I know quantifying the goodness of entropy sources on computers is an open research problem. I think adding entropy from cpu sources like HAVEGED does is helpful, especially in situations like device startup and other situations where there’s not a lot of entropy available from other sources.

References

[1] Katriel Cohn-Gordon, Cas Cremers, Benjamin Dowling, Luke Garratt, and Douglas Stebila, “A Formal Security Analysis of the Signal Messaging Protocol”, https://eprint.iacr.org/2016/1013.pdf

[2] Daniel J. Bernstein, Tanja Lange, Ruben Niederhagen. “Dual EC: a standardized back door.” Pages 256–281 in The new codebreakers: essays dedicated to David Kahn on the occasion of his 85th birthday, edited by Peter Y. A. Ryan, David Naccache, Jean-Jacques Quisquater. Lecture Notes in Computer Science 9100, Springer, 2015. ISBN 978-3-662-49300-7.

[3] https://plus.google.com/+TheodoreTso/posts/SDcoemc9V3J

[4] Becker et.al., CHES’13 Proceedings of the 15th international conference on Cryptographic Hardware and Embedded Systems,Pages 197-214, DOI 10.1007/978-3-642-40349-1_12, http://sharps.org/wp-content/uploads/BECKER-CHES.pdf

[5] http://blog.cr.yp.to/20140205-entropy.html

[6] D. Davis, R. Ihaka, and P. Fenstermacher, “Cryptographic Randomness from Air Turbulience in Disk Drives,” Advances in Cryptology — CRYPTO ’94 Proceedings, Springer-Verlag, 1994, pp. 114–120.

Machine Learning Newsticker

Markus — Sat, 02 Apr 2016 23:26:16 +0000

Shameless plug: I’ve fixed my scripts and my Machine Learning Newsticker (that reaches back to February 2005) is back in business. RSS Link Expect periodic updates on ML and AI related news items.

Chromebook Reboot Loop

Markus — Tue, 02 Dec 2014 05:28:51 +0000

My Chromebook (a Samsung XE500C21) ended up stuck in a reboot loop. Various keyboard shortcuts I found online didn’t work and I ended up having to wipe the disk and restore from a recovery image. Fortunately all data was in Google Drive anyways. I had to perform a Chromebook hard reset to get to the recovery screen. I wrote down the model number and then followed the recovery instructions for using a USB stick: Chromebook Recovery Image . I used the Linux based method. The tool found a suitable boot image, wrote it onto the designated USB stick. I did another hard reset and then inserted the USB stick with the image. The image installed itself without further intervention and after a reboot everything was fixed. All my settings, Chrome extensions etc. were synced to the machine after the first login. The only thing not synced was the wallpaper. Oh well…

Random Number Generation on low entropy computer systems

Markus — Mon, 13 Oct 2014 06:42:52 +0000

Lately, I got interested in random number generation. Generating random numbers securely so they are suitable for cryptographic use is hard.

Linux introduced /dev/random, a block device to a random number generator in the kernel. The generator keeps an estimate of the number of bits of noise in the entropy pool. From this entropy pool, random numbers are created by using a secure hash function on the data in the pool . When read, the /dev/random device will only return random bytes within the estimated number of bits of noise in the entropy pool. The Operating System adds entropy from all sorts of sources, such as disk latency. However, if you are running Linux on an embedded device (or “internet of things”) or in a cloud instance, there may not be a “real” disk drive, and whatever latency is measured may not contain good enough entropy. Another important source of entropy is the clock-cycle timer (rdtsc) that is called at various instances to measure the time between interrupts and other frequently and somewhat randomly occurring events. The rdtsc instruction is often virtualized in cloud-machines which could make it harder to get good entropy. Anyways… the problem of getting random numbers in low-entropy situations is important enough that Intel added an on-chip random number generator to their new processors (rdrand), but many cloud instances emulate older CPUs and embedded system don’t necessarily run with x86 chips that have rdrand.

So far, so good. For now, I don’t have an embedded system to experiment with, but a similar problem is with servers in the cloud. Can the rdtsc instruction still be used to get entropy in a virtual machine? I found this, played around a bit with the program listed and also learned about the existence of haveged (I’ll get back to that). I started a micro-instance (1 CPU, 512Mb RAM) in a popular cloud service and found that, yes indeed, the rdtsc instruction seems to be virtualized. The average tick count between millions of two consecutive rdtsc instructions is far too small to account for whatever else is going on on the machine (unless, of course, I somehow got a 512MB machine all by myself – which is unlikely). On the other hand, the output of the least significant bit of the difference of two consecutive rdtsc calls looked pretty random, but didn’t pass randomness tests. Adding von-Neumann whitening on the LSB obtained from two rdtsc calls is actually sufficiently random to pass FIPS 140-2 randomness tests (I used the implementation in rngtest) and only fails about 1 in 1000 (comparison: a test-run with entropy from /dev/random failed 1:10000). So in theory, this should be okay to use as an entropy source, but maybe it should still be combined with other sources of entropy.

Coming back to other already existing solutions. It turns out there’s an entropy gathering demon called haveged that can add entropy to the pool to remedy low-entropy conditions that can occur in servers, but possibly also on embedded devices and in cloud instance. The method exploits the modifications of the internal volatile hardware states as a source of uncertainty. Modern superscalar processors feature various hardware mechanisms which aim to improve performance: caches, branch predictors, TLBs and much more. The state of these components is also volatile and cannot be directly monitored in software. On the other hand, every invocation of the operating system modifies thousands of these binary volatile states. The general idea of extracting entropy from rdtsc still works, however the only thing that still makes me wonder a bit is that the full timer (not just the LSB) is used and that the whitening method is rather complex.

I learned a lot, but there’s still more to explore. Cloud machines probably have more going on than embedded systems, so I’m still not convinced that this will work on embedded devices. I’ll try to wrap my head around the whitening function in haveged. It’s still being developed and maybe ARM support will be added some day. Then the demon could be used on smart phones to improve the security of the random number generation.

Strengths and Weaknesses of various Clustering Algorithms

Markus — Sun, 27 Apr 2014 23:51:26 +0000

Clustering is one of the most fascinating areas in the data mining field for me. For example, it can be shown that under some simple and fairly straightforward it is impossible to find a clustering function (Kleinberg’s impossibility theorem for clustering).That means that some trade-offs and relaxations have to be made in all the existing clustering algorithms. Every clustering algorithm will have a weakness and fail to cluster some data sets in a sensible manner.

The Fundamental Clustering Problem Suite (Ultsch, 2005) contains several synthetic data sets and a variety of clustering problems that algorithms should be able to solve. The data set contains the data and ground-truth labels as well as labels for k-Means, Single-Linkage, Wards-Linkage and a Self-Organizing map. I decided to play around with it a bit, converted the SOM activation matrix to labels using k-Means, added Spectral Clustering and Self-Tuning Spectral Clustering, and an EM Gaussian Mixture model. I was particularly curious how well Spectral Clustering would do. Determining the true number of clusters from the underlying data is an entirely different problem. Hence in all cases the number of clusters was specified unless otherwise noted. The regular Spectral Clustering used the Ng-Jordan-Weiss algorithm with a kernel sigma of 0.04 after linear scaling of the inputs. The Self-Tuning Spectral Clustering used k=15 nearest neighbors. I also used Random Forest’s “clustering”, an extension of the classification algorithm that will generate a distance matrix from the classification tree ensemble. The distance matrix was then used in single linkage to obtain cluster labels. Random Forest is a special case as it derives a distance matrix out of classification using gini – a metric built for classification, not clustering.

Results

The Atom Data set contains two clusters, essentially two balls contained within each other. The difficulty is the difference in variance and that the classes are not linearly separable. Not surprisingly, k-Means can not separate these clusters out as it would somehow have to choose the same mean for both clusters (the center of both balls). Single linkage just works. I’m a bit puzzled why Wards Linkage fails. SOM can not separate the clusters (could this be an artifact of using k-Means to cluster the activation matrix?). Spectral Clustering was constructed specifically to deal with these kind of cases. The failure to separate the balls could stem from me using a constant kernel sigma on all data sets. Using the localized scaling of self-tuning spectral clustering just works. EM can’t model this as well due to the mean (center) of the distributions being the same. Random Forests just fail, because the distance matrix that is generated from the random permutation of the data to obtain a second class for “classification” and distance matrix generation will generate two overlapping classes.

The Chainlink Data set contains two clusters, essentially two interlocked rings that are not linearly separable. K-Means has no way of separating this as there is no point the mean could be placed resulting into a correct label assignment. Single Linkage just works and connects all the nearest neighbors to clusters. I’m not sure why Ward’s Linkage fails. SOM can not separate the clusters (could this be an artifact of using k-Means to cluster the activation matrix? In the original paper the algorithm solved this. Maybe I did something wrong). Spectral Clustering was made for these kind of manifold cases. I’m a bit surprised how well EM works in this case. I would have expected a similar result to k-Means.

The Engytime data set contains two Gaussian mixtures that are fairly close to each other. This data is primed for EM style algorithms and indeed EM performs best on this data set. K-Means performs reasonably well and picks up on the different variance. Single Linkage simply connects all the points to one big cluster and leaves two outliers to form the second clusters. I’m a bit surprised by the SOM and have no explanation why the clusters we separated the way they were. I would have thought that Gaussian Mixtures would have performed better on this one. Given that the data was generated using two Gaussians this should have been a home run for the algorithm.

Golfball contains no clusters and is one giant blob. We choose k=6 in order to see what the algorithms would do. Single linkage is the only algorithm that produces a somewhat sensible result. It connects all points to one cluster and then leaves 5 points to form the remaining clusters. It would be evident from a dendrogram that there are no clusters in this data set. All other algorithms assign labels one way or the other, most produce evenly sized areas on the ball.

Hepta contains clearly defined clusters with different variances. The clusters are clearly separated and hence all algorithms have no problems separating these clusters. SOM mislabels a few cases and I’m not sure why this is.

Lsun contains different variances and inter cluster distances. The hard part is to separate out the cluster on the bottom.

Target contains two clusters in the middle and a few outliers. The data is not linearly separable, but it’s also interesting to see how the algorithms deal with the outliers.

Tetra contains four dense, almost touching clusters.

The two diamonds data set contains two touching clusters. The cluster borders defined by density. As one would expect, single linkage fails on this data set and simply lumps everything together. Wards Linkage was made to prevent exactly this kind of problem and not surprisingly performs better. The other algorithms have no problems picking up on the two dense blobs and separate them out perfectly or close to perfect.

The Wingnut data set contains two blocks and examines the density vs. distance trade-off of the algorithms. Every method that uses distance (or something that could be interpreted as such) will have the clusters “bleed” into each other.

Note that my results for SOM differ a bit from Ultsch’s original paper; quite possibly I did something wrong. I haven’t figured out yet what went wrong; still… fun weekend

References

Ultsch, A.: Clustering with SOM: U*C, In Proc. Workshop on Self-Organizing Maps, Paris, France, (2005) , pp. 75-82

Detecting Click Fraud in Online Advertising

Markus — Sun, 02 Mar 2014 21:40:45 +0000

JMLR has an interesting paper summarizing the results from a contest to build the best model for ClickFraud detection. The second place entry described some nice feature engineering that I found interesting. The first place did feature selection and then used gbm, a really good ensemble algorithm.

Big Data, Predictive Policing and the Tyranny/Anarchy Trade-Off

Markus — Sat, 17 Aug 2013 23:29:24 +0000

Bloomberg had an interesting article today about a talk on the implications and privacy trade-offs of predictive policing and profiling. Jim Adler’s “felon classifier” is also described in his blog.

Basically, he built a classifier predicting from some innocuous (but possibly correlated variables) the likelihood of somebody having a felony offense. The classifier isn’t meant to be used in practice (from eye-balling the Precision/Recall curve in the talk slides, I estimate an AUC of about 0.6-ish; not too great), but it was built to start a discussion. It turns out that courts have upheld the use of profiling in some cases as “reasonable suspicion,” a legal standard for the police to stop somebody and investigate. This could lead to “predictive policing” being taken even further in the future. Due to the model outputting a score Jim also discusses the trade-off of where the prediction of such a model may be actionable – he calls it the Tyranny/Anarchy Trade-Off (a catchy name

Having done statistical work in criminal justice before, I think predictive analysis can be helpful in many areas of policing and criminal justice in general (e.g., parole supervision). On the other hand, I find profiling and supporting a “reasonable suspicion” from statistical models unconvincing. I think the courts will have to figure out a minimum reliability standard for such predictors, and hopefully they’ll set the threshold far higher than what the ‘felony classifier’ is producing. There’s just too many ways using a statistical model for “reasonable suspicion” to go wrong. Even if variables of protected classes (gender, ethnicity, etc.) are not used directly, there may be correlated variables (hair-color, income, geographic area) as discussed in the talk Jim gave. Even more problematic in my mind would be variables that do not or hardly ever change, as they would lead to the same people being hassled over and over again. Also the training data from which these models are built is biased since everybody in it by definition has been arrested before. It’s beyond me how one can correct for this sample bias in a reliable way. Frankly, I don’t think policing by profiling (statistical or otherwise) can be done well, and hopefully courts will recognize that eventually.

Removing AVG Secure Search from Chrome

Markus — Sat, 11 May 2013 23:56:57 +0000

I use AVG as my anti-virus and today it installed the whole safe-search toolbar thing in the background while I was playing a video game. I’m pretty sure I didn’t consent to that, but whatever… Undoing the damage to my settings took quite a bit of work and I had trouble removing it from Chrome. First I followed all the steps outlined here, but pressing the home button in Chrome would still bring up the “mysearch.avg.com” website no matter what settings I changed.

Uninstalling the AVG toolbar component finally solved the Chrome start-page problem for me.

Man, what I piece of work. I’m not alone thinking that AVG did something bad for the user here.