You are currently browsing the archives for the Statistics category.
| M | T | W | T | F | S | S |
|---|---|---|---|---|---|---|
| « Jan | ||||||
| 1 | 2 | 3 | 4 | 5 | ||
| 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | ||||
- Advertising (1)
- Artificial Intelligence (AI) (13)
- Classification (3)
- Clustering (1)
- Coding / Programming (8)
- Cryptography (1)
- Data Mining (22)
- Economy / Investing (1)
- ewrt linux (2)
- Fixing Stuff (8)
- Machine Learning (32)
- Math (2)
- Politics (3)
- Predictive Modeling (5)
- Psychology (3)
- Ramblings (26)
- Random (9)
- Security (16)
- Society (13)
- Sociology (4)
- spam (3)
- Statistics (20)
- January 28, 2012 4:56 pm: Will 2012 be the year of Big Data?
- August 14, 2011 10:41 pm: UK plans to exempt data mining from copyright laws
- June 21, 2011 3:26 am: Risk Assessment of Rare Events in adversarial Scenarios
- March 26, 2011 7:57 pm: How Kinect body tracking works and how Machine Learning helped
- March 1, 2011 11:58 am: European Court of Justice ruling (indirectly) on what cannot be used in Insurance Risk Models
- December 11, 2010 8:35 pm: Mining of Massive Datasets
- December 4, 2010 2:28 pm: Ideas on communicating risks and probabilities to the general public
- October 17, 2010 5:48 pm: Birthday Paradox
- August 5, 2010 1:06 am: Elo Scores and Rating Contestants
- July 11, 2010 8:56 pm: GraphLab & Parallel Machine Learning
Blogroll
Uncategorized
Useful Links
- January 2012
- August 2011
- June 2011
- March 2011
- December 2010
- October 2010
- August 2010
- July 2010
- June 2010
- February 2010
- January 2010
- November 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
Archive for the Statistics Category
Risk Assessment of Rare Events in adversarial Scenarios
June 21, 2011 3:26 am by Markus.
The RAND corporation just published an interesting paper exploring the benefits of using risk prediction to reduce the screening required at airports. You might have noticed various attempts to establish some kind of fast-lane or trusted traveler program. Obvious this is a very sensitive topic and probably hard to get right. Screening certain groups of the population more than others (”profiling”) is generally frowned upon and also not a good idea in general (see “Strong profiling is not mathematically optimal for discovering rare malfeasors on rare event detection“), but what hasn’t been examined much is identifying people that can be considered more “safe” than others. The paper explores that idea and shows that even under the assumption that the bad guys will try and subvert this program that there can be benefits to implementing this solution. The paper is a bit sparse on mathematical details. Certainly an interesting idea, though.
Posted in Predictive Modeling, Society, Statistics, Security | Print | No Comments »
European Court of Justice ruling (indirectly) on what cannot be used in Insurance Risk Models
March 1, 2011 11:58 am by Markus.
I’m not sure what to think of it. For one, insurance is not about fairness; it’s about risk. An insurance company should be able to use whatever reliable information for determining the true risk to help price policies. From what I’ve read it seems that young men cost ~50% more to insure than young women. This might not be true on an individual level, but it is true across the entire pool people. On the other hand, if all reliable information could be used, then health insurance would naturally be more expensive for people with, e.g., known genetic disorders if it were purely about risk. That wouldn’t be fair either. Legislating what can and cannot be used in what circumstances will be a hard trade off. In the intermediate term this ruling will probably lead to models that are using all sorts of things to work around this ruling in order to get an adequate risk score.
Posted in Statistics | Print | No Comments »
Ideas on communicating risks and probabilities to the general public
December 4, 2010 2:28 pm by Markus.
I found an interesting article on how to communicate risks and probabilities to the public.
Posted in Statistics | Print | No Comments »
Birthday Paradox
October 17, 2010 5:48 pm by Markus.
Here’s an interesting real world example for the Birthday Paradox: Lottery number combination repeats itself. Obligatory XKCD link.
Posted in Statistics | Print | No Comments »
Elo Scores and Rating Contestants
August 5, 2010 1:06 am by Markus.
Kaggle has a new and interesting competition on building a chess rating algorithm that performs better than the official Elo rating system currently in use. Entrants build their own rating systems based on the results of more than 65,000 historical chess games and then test their algorithms by predicting the results on a holdout set of 7,800 games.
Looks like an interesting problem. The only other thing that comes to my mind literature-wise is that Microsoft built and published their TrueSkill™ Ranking System for the XBox in order to match players with similar skills in online games. In the original paper at NIPS, the authors had shown that TrueSkill outperformed Elo.
Posted in Statistics, Machine Learning | Print | No Comments »
Alternative measures to the AUC for rare-event prognostic models
February 16, 2010 11:56 pm by Markus.
How can one evaluate the performance of prognostic models in a meaningful way? This is a very basic and yet an interesting problem especially in the context of prediction of very rare events (base-rates <10%). How reliable is the model’s forecast? This is a good question and of particular importance when it matters - think criminal psychology where models forecast the likelihood of recidivism for criminally insane people (Quinsey 1980). There are a variety of ways to evaluate a model’s predictive performance on a hold out sample, and some are more meaningful than others. For example, when using error-rates one should keep in mind that they are only meaningful when you consider the base-rate of your classes and the trivial classifier as well. Often this gets confusing when you are dealing with very imbalanced data sets or rare events. In this blog post, I’ll summarize a few techniques and alternative evaluation methods for predictive models that are particularly useful when dealing with rare events or low base-rates in general.
The Receiver Operator Characteristic is a graphical measure that plots the true versus false positive rates such that the user can decide where to cut for making the final classification decision. In order to summarize the performance of the graph in a single, reportable number, the area under the curve (AUC) is generally used.
Posted in Statistics, Classification, Data Mining, Machine Learning | Print | 2 Comments »
Strong profiling is not mathematically optimal for discovering rare malfeasors (on rare event detection)
January 10, 2010 5:37 pm by Markus.
Just in time for the latest Christmas terror scare, I came across an interesting paper: “Strong profiling is not mathematically optimal for discovering rare malfeasors” (William H. Press; PNAS 106(6), p. 1716-1719 www.pnas.org/cgi/doi/10.1073/pnas.0813202106). In the paper, the author investigates whether profiling by nationality or ethnicity can be justified mathematically and tries to answer the question of how much screening must we do, on average, to catch the bad guys in the crowd. Rare events detection is hard as it is, and it’s interesting to see a look from the sampling perspective. It’s an interesting and short read. Long story short, it shows that using an indiscriminate feature like nationality or ethnicity is not optimal (as is any screening at least in proportion to a prior probability) and wastes resources.
Posted in Math, Society, Statistics | Print | 2 Comments »
Adversarial Scenarios in Risk Mismanagement
January 11, 2009 4:31 pm by Markus.
I just read another article discussing weather Risk Management tools had an impact on the current financial crisis. One of the most commonly used risk management measures is the Value-at-Risk (VaR) measure, a comparable measure that specifies a worst-case loss for some confidence interval. One of the major criticisms is (e.g. Nassim Nicholas Taleb, the author of the black swan) that the measure can be gamed. Risk can be hidden “in the rare event part” of the prediction and not surprisingly this seems to have happened.
Given that a common question during training with risk assessment software is “what do I do to get outcome/prediction x” from the software it should be explored how to safeguard in the software against users gaming the system. Think detecting multiple model evaluations with slightly changed numbers in a row…
Edit: I just found an instrument implemented as an Excel Spreadsheet. Good for prototyping something, but using that in practice is just asking people to fiddle with the numbers until the desired result is obtained. You couldn’t make it more user-friendly if you tried…
Posted in Predictive Modeling, Society, Statistics, Data Mining, Machine Learning | Print | No Comments »
Credit Card companies adjusting Credit Scores
December 22, 2008 10:27 am by Markus.
I just read that Credit Card Companies are adjusting Credit Scores based on shopping patterns in addition to credit-score and payment history. It seems they also consider which mortgage lender a customer uses and whether the customer owns a home in an area where housing prices are declining. All that to limit the growing credit card default rates.
That’s an interesting way to do it (from a risk modeling point of view) and I wonder how well it works in practice. There might also be some legal ramifications to this if it can be demonstrated that this practice (possibly unknowingly to them) discriminates e.g. against minorities.
Posted in Statistics, Data Mining, Machine Learning | Print | No Comments »
Deploying SAS code in production
November 1, 2008 9:48 pm by Markus.
I had written a post about the issues of converting models into something that is usable in production environments as most stats-packages don’t have friendly interfaces to integrate them into the flow of processing data. I worked on a similar problem involving a script written in SAS recently. To be specific, some code for computing a risk-score in SAS had to be converted into Java and I was confronted with having to figure out the semantics of SAS code. I found a software to convert SAS code into Java and I have to say I was quite impressed with how well it worked. Converting one language (especially one for which there is no published grammar or other specification) into another is quite a task - after a bit of back and forth with support we got our code converted and the Java code worked on the first try. I wish there would be similar converter for STATA, R and SPSS ![]()
Posted in Predictive Modeling, Coding / Programming, Statistics, Data Mining, Machine Learning | Print | 1 Comment »