February 2012
M T W T F S S
« Jan    
 12345
6789101112
13141516171819
20212223242526
272829  

Author Archive

Will 2012 be the year of Big Data?

Interesting view on that here.

UK plans to exempt data mining from copyright laws

The UK is in the process of overhauling their overly stringent copyright laws. That’s an interesting development (see the Nature blog entry on the topic).  One idea being discussed is to generally allow data and text mining without the copyright holders permission, which would usually be required for any kind of electronic processing.

Risk Assessment of Rare Events in adversarial Scenarios

The RAND corporation just published an interesting paper exploring the benefits of using risk prediction to reduce the screening required at airports. You might have noticed various attempts to establish some kind of fast-lane or trusted traveler program. Obvious this is a very sensitive topic and probably hard to get right. Screening certain groups of the population more than others (”profiling”) is generally frowned upon and also not a good idea in general (see “Strong profiling is not mathematically optimal for discovering rare malfeasors on rare event detection“), but what hasn’t been examined much is identifying people that can be considered more “safe” than others. The paper explores that idea and shows that even under the assumption that the bad guys will try and subvert this program that there can be benefits to implementing this solution. The paper is a bit sparse on mathematical details. Certainly an interesting idea, though.

Paper: Assessing the Security Benefits of a Trusted Traveler Program in the Presence of Attempted Attacker Exploitation and Compromise

How Kinect body tracking works and how Machine Learning helped

Microsoft Research has published a paper explaining how the Kinect body tracking algorithm works [PDF]. This video shows how it all comes together. They trained a variation of Random Forests on the various pre-labeled images to identify the various body parts from a normal RBG camera and a depth-camera. The way they create many more training images from previously captured data is also interesting. The final system can run at 200 frames per second and it doesn’t need an initial calibration pose. Very interesting…

European Court of Justice ruling (indirectly) on what cannot be used in Insurance Risk Models

Insurers cannot charge different premiums to men and women because of their gender, the European Court of Justice (ECJ) has ruled.

I’m not sure what to think of it. For one, insurance is not about fairness; it’s about risk. An insurance company should be able to use whatever reliable information for determining the true risk to help price policies. From what I’ve read it seems that young men cost ~50% more to insure than young women. This might not be true on an individual level, but it is true across the entire pool people. On the other hand, if all reliable information could be used, then health insurance would naturally be more expensive for people with, e.g., known genetic disorders if it were purely about risk. That wouldn’t be fair either. Legislating what can and cannot be used in what circumstances will be a hard trade off. In the intermediate term this ruling will probably lead to models that are using all sorts of things to work around this ruling in order to get an adequate risk score.

Mining of Massive Datasets

Anand Rajaraman and Jeff Ullman wrote a book called Mining of Massive Datasets that can be downloaded for free (PDF, 340 pages, 2MB). It focuses on data mining of very large amounts of data that do not fit in main memory as found on the frequently on the web from an algorithmic point of view.

Edit:Fixed URL

Ideas on communicating risks and probabilities to the general public

I found an interesting article on how to communicate risks and probabilities to the public.

Birthday Paradox

Here’s an interesting real world example for the Birthday Paradox: Lottery number combination repeats itself. Obligatory XKCD link.

Elo Scores and Rating Contestants

Kaggle has a new and interesting competition on building a chess rating algorithm that performs better than the official Elo rating system currently in use. Entrants build their own rating systems based on the results of more than 65,000 historical chess games and then test their algorithms by predicting the results on a holdout set of 7,800 games.

Looks like an interesting problem. The only other thing that comes to my mind literature-wise is that Microsoft built and published their TrueSkill™ Ranking System for the XBox in order to match players with similar skills in online games. In the original paper at NIPS, the authors had shown that TrueSkill outperformed Elo.

GraphLab & Parallel Machine Learning

Interesting article:  GraphLab: A New Framework for Parallel Machine Learning

From the abstract:

Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. We demonstrate the expressiveness of the GraphLab framework by designing and implementing parallel versions of belief propagation, Gibbs sampling, Co-EM, Lasso and Compressed Sensing. We show that using GraphLab we can achieve excellent parallel performance on large scale real-world problems.

Given all the talk about Map-Reduce, Hadoop etc. this seems like a logical next step to make scaling data mining to large data sets a lot easier.