You are currently browsing the archives for the Data Mining category.
| M | T | W | T | F | S | S |
|---|---|---|---|---|---|---|
| « Apr | ||||||
| 1 | 2 | 3 | 4 | |||
| 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 | 31 | |
- Advertising (1)
- Artificial Intelligence (AI) (8)
- Coding / Programming (6)
- Cryptography (1)
- Data Mining (10)
- ewrt linux (2)
- Fixing Stuff (5)
- Machine Learning (18)
- Math (1)
- Politics (2)
- Psychology (3)
- Ramblings (18)
- Random (6)
- Security (11)
- Society (9)
- Sociology (3)
- spam (2)
- Statistics (9)
- April 21, 2008 1:38 am: ART OF SEDUCTION: Not Pretty, Really
- March 25, 2008 2:25 am: "Internal Server Error" when converting phpBB v2 to phpBB v3
- March 6, 2008 1:29 am: Firewire and DRM
- February 28, 2008 10:46 pm: Using Psychological Domain Knowledge for the Netflix Challenge
- February 12, 2008 1:24 am: VPN Tunels from within VMWare (Windows XP and GRE weirdness)
- February 2, 2008 5:59 pm: License Key Copy Protection
- January 8, 2008 8:34 pm: Registering Domains with Network Solutions
- January 7, 2008 10:22 pm: Joe-job ...
- December 11, 2007 1:37 am: Back from NIPS 2007
- November 24, 2007 1:03 am: GMail Logout Strangeness
Blogroll
Useful Links
Archive for the Data Mining Category
Back from NIPS 2007
December 11, 2007 1:37 am by Markus.
Just got back home from NIPS. The following papers I found pretty interesting:
- Random Features for Large-Scale Kernel Machines
- On Ranking in Survival Analysis: Bounds on the Concordance Index
- Efficient Inference for Distributions on Permutations
The workshop time I spend in the Learning Problem design and in the Security workshop. I also dropped by in “statistical networks” briefly, but there’s room for improvement in my current understanding of Gibbs sampling and the like. The consensus in the problem design workshop seemed to be that machine learning must become more modular. Also there was agreement that the application of machine learning in the real world requires some magic for transforming the problem into “features” and some more magic for transforming the prediction into something useful. It was stating the obvious a bit, however not much progress has been made in this area of making ML more accessible. I wrote about choosing the right features before, but currently it’s more of an art than a science. One thing I took from the security workshop was that features must be easily constructed (most detection apps must run in real-time). This means we are interested in features the attacker can hardly influence (think received-headers in Spam-emails that can not be suppressed), yet they must be easily to compute.
Also really cool was the “NIPS Elevator Process” joke-paper about hungry scientists on the way to lunch (don’t confuse it with the Chinese restaurant process) and the party crashers at the Gatsby party. Sophie and some friends of hers simply joined the fun. The fun part was people taking her random answers for her research topic seriously
I got mistaken for one of the party crashers at one point, because I didn’t fit in with my clothing. I was actually planning on hitting a club in downtown Whistler, but didn’t get around to go in the end…
Posted in Data Mining, Machine Learning | Print | No Comments »
Human Intuition vs. Statistical Models
August 25, 2007 10:01 pm by Markus.
I just came across a very interesting book announcement for “Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart” by Ian Ayres, a professor Yale Law School and econometrician. In the book (I haven’t read it yet, but I will) the author argues that intuition is losing ground to statistical methods and data mining. According to the Amazon abstract he gives examples from the airline industry, medical diagnostics and even online dating services showing that a statistical model will outperform human intuition.
That machines can outperform human judgement has been known for quite some time. For example, in the field of psychology the diagnosis of mental disorders is more or less standardized by them DSM. There was a very interesting meta-analysis that showed that a mechanical predictor always outperformed the human psychologist. To be specific: Grove, W.M., Zald, D.H., Hallberg, A.M., Lebow, B., Snitz, E., & Nelson, C. (2000). Clinical versus mechanical prediction: A meta-analysis. Psychological Assessment, 12, 19–30. To quote from the Abstract: “On average, mechanical-prediction techniques were about 10% more accurate than clinical predictions. Depending on the specific analysis, mechanical prediction substantially outperformed clinical prediction in 33%-47% of studies examined. Although clinical predictions were often as accurate as mechanical predictions, in only a few studies (6%-16%) were they substantially more accurate. Superiority for mechanical-prediction techniques was consistent, regardless of the judgment task, type of judges, judges’ amounts of experience, or the types of data being combined.”
I’m a little bit skeptical about using data crunching to decide important questions (as in life and death questions). In general it seems like a good idea, but it always comes down to how you model the data and how you model the question to be answered. In many cases this might be obvious, in others not so much. The art is then to model the data, not the application of the algorithm or technique. It reminds me a bit of a class about formal program verification I took back in Darmstadt. Stefan, the TA of the class, and I had an argument about the use of practicability of program verification. He gave the unix find utility as an example for which you can show - more or less - easily that the program will terminate while enumerating all the files in all the directories in the system, and how find can be nicely modeled with a well-founded relation to show the termination of the algorithm. I objected that I could set a symbolic link to a uper-level directory (which is why find does not by default follow them) and could make find go in circles. Stefan conceded, “Oh well, I guess then the model was wrong…”. Similar things have happened in e.g. Cryptography, where a finite-state model (sorry, lost the citation somewhere; I’m not quite sure if that was the Usenix paper from the Stanford guys I read or something else) showed that the SSL protocol (Secure Socket Layer) is secure. Later the protocol was broken nonetheless (Schneier, Bruce; Wagner, David; Analysis of the SSL 3.0 protocol).
I think that with the wrong model you can show a lot of good things about anything. Once you abstract from the real world and build a model you might just have ignored that little most important feature. Maybe it is time for a best-practices in data modeling and data mining (there are already some books out there for some specific domains) …
Posted in Data Mining, Machine Learning, Ramblings | Print | No Comments »
Modeling Tree Structures and Hierarchies in a Data Warehouse
July 20, 2007 9:25 pm by Markus.
I’m currently reading a book about Data Warehouse design (”Mastering Data Warehouse Design”, Claudia Imhoff, Nicholas Galemmo and Jonathan Geiger, Wiley). One thing I noticed is the incredibly inefficient way the authors encode trees in relational databases. Their suggestion is to model it with “pointers” to child-nodes which is incredibly inefficient to deal with in SQL, leads to recursive queries (unless proprietary SQL extensions are used) and you’ll have to write loads of self-joins. A much better way of encoding trees in SQL is based on nested sets. However, it always depends on what kind of queries you will run later. According to the book, those would be listing elements on the same level of the tree as well as retrieving sub-trees. This is something that I think is still kind of painful when using nested sets.
Here is my favorite solution for encoding trees in SQL:
CREATE TABLE TreeMagic (Mykey CHAR(10) PRIMARY KEY, FatherNode ChAR(10) NOT NULL, length INTEGER NOT NULL);
| MyKey | FatherNode | Length |
|---|---|---|
| A | 1 | |
| AA | A | 2 |
| AB | A | 2 |
| AAA | AA | 3 |
| AAB | AA | 3 |
| ABA | AB | 3 |
| ABB | AB | 3 |
So key A is the root, AA and AB are child-nodes of A, AAA and AAB are child-nodes of AA, and so on. The cool part is traversing the tree on one level is easy due to the length-field and selecting subtrees is easy as well using the like-operator, which moves all the hard work into the B-Tree index. Inserting a new node is simpler than with set-based trees for which in the worst case you might have to increase the left/right numbers at a couple of other nodes. The path to each node is always fully known. Traversing the tree in DFS comes for free by lexicographical ordering of the keys with an “order by” clause. BFS is available when ordered by “length.FatherNode”. One operation the nested sets can do faster is determining the number child-nodes just with a bit of math. Either way I think this idea is way more efficient than the one proposed in the book
Edit: I just found a book that is full of patterns for modeling the most common structures in relational databases: “SQL for Smarties - Advanced SQL Programming” by Joe Celko (who also wrote the article about the nested-set method for trees). It contains various graph problems and how to manage them in relational databases. Looks interesting…
Posted in Coding / Programming, Data Mining | Print | No Comments »
Text Mining for Tax Evaders on eBay
June 12, 2007 3:32 pm by Markus.
A (long) while ago there was a lot of talk about the IRS doing text-mining for people that did not report income they got from e.g. ebay internet auctions. The German version of this attempt is called XPIDER (see also here and there) was just shown to be totally ineffective. After years of trying and spending millions they did not identify conclusive evidence to go after a single tax-evader. The German GAO (Bundesrechnungshof) is trying to find out what went wrong, why and who misspend all that money. I wonder if the more sophisticated web-crawling program the Canadian IRS is using (it’s called XENON as reported by Wired) is similarly effective …
Posted in Society, Data Mining | Print | No Comments »
Choosing the right features for Data Mining
June 1, 2007 12:02 pm by Markus.
I’m fascinated by a common problem in data mining: how do you pick variables that are indicative of what you are trying to predict? It is simple for many prediction tasks; if you are predicting some strength of a material, you sample it at certain points using your using your understanding of physics and material sciences. It is more fascinating with problems that we would believe to understand, but don’t. For example, what makes a restaurant successful and guarantees repeat business? The food? The pricing? A friendly wait-staff? Turns out a very big indicator for restaurant success is the lighting. I admit that lighting didn’t make it into my top ten list… If you now consider what is asked on your average “are you satisfied with our service” questionnaire you can find in a various restaurant chains, then I don’t recall seeing anything about the ambiente on it. We are asking the wrong questions.
There are many other problems in real life just like this. I read a book called Blink and the point the author is trying to make is that making subconscious decisions are easy to make - once you know what to look for. More information is not always better. This holds for difficult problems such as judging the effectiveness of teachers (IIRC seeing the first 30 seconds of a videotape of him/her entering a classroom is as indicative as watching hours of recorded lectures). Same holds true for prediction problems about relationships - how can you predict if a couple will still be together 15 years later? Turns out there are four simple indicators to look for, and you can do it in maybe 2 minutes of watching a couple… The book is full of examples like that, but does not provide a way to “extract the right features”. I have similar problems with the criminology stuff I’m working on; while we get pretty good results using features suggested by the criminology literature I’m wondering if we have the right features. I’m still thinking that we could improve our results if we had more data - or the “right” data I should say (it should be obvious that more is not better by now). How do you pick the features for problems? Tricky question…
There is only data mining system that does not have this problems: recommender systems. Using recommender systems can avoid the problem as they do not rely on particular features to predict, but exploit correlations in “liking”. A classical example was that people that like classical music often like jazz as well - something you wouldn’t easily be able to predict from features you extract from the music. I wonder if we could reframe some prediction problems in ways more similar to recommender systems, or maybe make better use of meta-information in certain problems. What I mean with “meta-information” is easily explained with an example: Pagerank. It is so successful in web-scale information retrieval because it does not bother with trying to figure out if a page is relevant by keyword ratios and what not, but simply measuring the popularity by how many important pages link to it (before link-spam became a problem that is). I wish something simple like that would be possible for every problem ![]()
Posted in Data Mining, Machine Learning, Psychology | Print | 2 Comments »
Validating patterns found by Data Mining techniques
February 22, 2007 2:50 pm by Markus.
I just read in the news about a study that showed with data mining that libras live longer (study from the Institute for Clinical Evaluative Sciences in Toronto). These guys did a study for fun using data from 10 million Ontario residents looking for associations between various health problems and their astrological signs. And they actually found associations! Each of the twelve astrological signs had at least two medical disorders associated with them. However, they were not able to replicate it when looking for the same pattern in the hold-out set.
This is a nice example of why it is important to have a hold-out set, or ideally to try to redo a study with a different method on a different set of similar data. A couple of other methods for validating have been proposed which are harder to do. In my data mining work with Tim we have used a cross-replication and cluster validation design from the book “Classification” by Gordon (1999). The original method (I think) was proposed by McIntyre and Blashfield (”A nearest-Centroid technique for evaluating the minimum variance clustering procedure”, Multivariate Behavioral Research 15, 1980; see also Milligan 1996 “Clustering Validation: results and implications for applied analysis, in “Clustering and Classification”; eds. Arabie, Hubert and De Soete, World Scientific, Singapore, p. 341). What I particularly like about the method is that it can be used to quantify in numbers “how much” of a replication you got. The method works like this:
- Divide the data randomly into two subsets
- Do your exploratory clustering work on set A, partitioning it into k many clusters
- Use a classifier (Nearest Neighbor for example) to assign each object in B to each discovered cluster
- Use the same clustering procedure on B, partitioning it into k many clusters
- Compare the two labelings obtained for B (the classification and clustering) in a cross-tabulation, compute a kappa-coefficient
Given our tendency to find patterns in data (or see shapes in clouds) I think it is important to use a procedure like the above to double-check the patterns discovered before any important decisions are made.
Posted in Statistics, Data Mining | Print | 1 Comment »
Artificial Intelligence and Sports
February 8, 2007 3:59 pm by Markus.
A couple of days ago Indianapolis won the Superbowl - just as predicted by an Electronic Arts Simulation. The simulation software had been fed with the latest data about all the players involved and they had the game AI fight it out. In the past some of the simulated outcomes were not that close to the final scores, but they still did a fairly decent job in 2005 and 2007.
There is more and more statistical decision making in baseball as well, the most famous example being the Miami-Orlando series in the 1997 playoffs.
Interesting…
Posted in Statistics, Data Mining, Machine Learning, Artificial Intelligence (AI) | Print | No Comments »
Just got back from NIPS 2006
December 12, 2006 4:49 pm by Markus.
Just got back in town from the NIPS conference. I’ve been to a couple of Machine Learning conferences before, but this was my first time at NIPS. A couple of papers were very interesting (you can download them at books.nips.cc) :
- Manifold Denoising
Matthias Hein, Markus Maier - Fundamental Limitations of Spectral Clustering Methods
Boaz Nadler, Meirav Galun - Learning with Hypergraphs: Clustering, Classification, and Embedding
Dengyong Zhou, Jiayuan Huang, Bernhard Schoelkopf - Recursive Attribute Factoring
David Cohn, Deepak Verma, Karl Pfleger
However, I found the single-track style of the conference boring at times. My interest in the latest results from fMRIs etc. is low right now, so at times there was nothing to do, but mingle or just do nothing. At ICML there is always at least one conference-track that is interesting to me. The poster sessions at NIPS were very interesting, though.
The workshops were more interesting than the conference. Only the room-sizes were misallocated. Some workshops (the one with the big rooms) were rather empty, and the ones I attended were overcrowded. And, of course, the traditional workshop summarys at the end of the workshop were funny. The ones that stuck out in my mind the most were Man vs. Bird from the Acoustic Processing Workshop and the novel applications for the non-linear dimensionality reduction with their swiss-roll video. I got a few new ideas from the workshops that maybe will work out.
Also there were no T-Shirts. At this years ICML plenty of free t-shirts were given out - unfortunately during the reception, which forced everybody to carry their T-Shirts around during the entire reception (it looked very amusing, though) - yet at NIPS all we got was a mug…
Last, but not least, I’ve heard about the legendary NIPS partys from my friends and had some high expectations :-). Friday night I attended the GURU-party from Garry’s Unbelievable Research Unit, Saturday was the legendary Gatsby-party. Both partys were rather disappointing, so I actually went to check out the nightlife in Whistler instead. The bars and clubs I found were pretty quiet as well. Uh well… I’ve heard from people that Whistler had less people than last year around that time.
Posted in Data Mining, Machine Learning, Ramblings | Print | No Comments »
Data Mining of Social Networks
September 30, 2006 9:25 pm by Markus.
I just returned from the ECML Data Mining Workshop and one talk I found particularly interesting. In the talk Network-based marketing: Identifying likely adopters via consumer networks (S. Hill, F. Provost and C. Volinsky) presenter reported on a successfull marketing campaign. Rough summary from the talk: A phone company was introducing a new service and from past experience they had twenty-something marketing segments for people that were likely to buy that they would write to, call or otherwise inform of the new service. Since the phone company has access to the call records they extracted a list of friends these likely buyers frequently call and started marketing to these people as well. The cool part is that the response rate from the friends was about 3 times higher than the likely-buyers response rate (or was it even the buy-rate). That said, so many companies now started to collect (or have available to them) social networks data, e.g. Skype (now EBay), Google (GMail invites), MySpace, Facebook etc. Most likely this will change the ways of advertising quite a bit. Sidenote: the companys lawyers felt this is legal, because the company owns that call data. Interesting how this is legaly different from the NSA-survailance-program the US government has been doing.
Posted in Data Mining | Print | No Comments »
Data mining used to find new materials
August 27, 2006 6:56 pm by Markus.
I just read an Eureka Alert (see also ZDNet’s blog)mentioning that a couple of researchers at MIT found new, potentially useful crystal structures with AI and Data Mining techniques. You can find the abstract of their paper here. I’ve seen randomness and Genetic Algorithms around alot lately (such as the self-reconfigurable-modular-robot/) and a robot that can do bioinformatics experiments (DNA sequencing) all by himself (link?). I think that this is a very useful application of AI. However, it is only an application of the scientific knowledge. It’s fast testing based on the current physical models and insights. It automates science to an extend, but does not come up with new insights. It’s more data without more people to add an interpretation. For example, it took a few years before somebody found an application for Teflon.
I haven’t seen this around (will search again), but what would be really interesting is an algorithm that can form a new hypothesis (e.g. a differential equation) based on outcomes from Physics experiments. An algorithms that explains the data and forms a theory. It’s probably harder to build than regression algorithms…
Posted in Data Mining, Machine Learning, Artificial Intelligence (AI), Ramblings | Print | No Comments »