July « 2007 « Markus Breitenbach

Archive for July, 2007

A strange end for a night…

Sunday, July 29th, 2007

Last night I was out with a couple of friends at our favorite club. I had the time of my life and met some interesting new people. When we left the club, a girl got hit by a car right in front of our eyes. I’ve seen quite a bit of “shit going down” in my life now (3 fires, one guy getting beat up in an alley, 2 times one or more people getting hit by a car, once getting hit by a car myself) and what I find interesting is the different reactions by people. Last night some guys freaked, some were calm, some tried to help, some girls just started crying. Most people just stared in some kind of apathy and did nothing. Some “tough guys”, who I wouldn’t have expected that from, freaked out. In all the instances I’ve been involved with I’ve stayed calm and just did what was necessary (including the one where I got hit). That said, 911 operators are kinda slow – why do they ask if the person hit by the car is OK when I already told them that the person is seriously injured? What a weird end for a good night – I’m still shaking a bit…

Posted in Ramblings | No Comments »

Modeling Tree Structures and Hierarchies in a Data Warehouse

Friday, July 20th, 2007

I’m currently reading a book about Data Warehouse design (“Mastering Data Warehouse Design”, Claudia Imhoff, Nicholas Galemmo and Jonathan Geiger, Wiley). One thing I noticed is the incredibly inefficient way the authors encode trees in relational databases. Their suggestion is to model it with “pointers” to child-nodes which is incredibly inefficient to deal with in SQL, leads to recursive queries (unless proprietary SQL extensions are used) and you’ll have to write loads of self-joins. A much better way of encoding trees in SQL is based on nested sets. However, it always depends on what kind of queries you will run later. According to the book, those would be listing elements on the same level of the tree as well as retrieving sub-trees. This is something that I think is still kind of painful when using nested sets.

Here is my favorite solution for encoding trees in SQL:

CREATE TABLE TreeMagic (Mykey CHAR(10) PRIMARY KEY, FatherNode ChAR(10) NOT NULL, length INTEGER NOT NULL);

MyKey	FatherNode	Length
A		1
AA	A	2
AB	A	2
AAA	AA	3
AAB	AA	3
ABA	AB	3
ABB	AB	3

So key A is the root, AA and AB are child-nodes of A, AAA and AAB are child-nodes of AA, and so on. The cool part is traversing the tree on one level is easy due to the length-field and selecting subtrees is easy as well using the like-operator, which moves all the hard work into the B-Tree index. Inserting a new node is simpler than with set-based trees for which in the worst case you might have to increase the left/right numbers at a couple of other nodes. The path to each node is always fully known. Traversing the tree in DFS comes for free by lexicographical ordering of the keys with an “order by” clause. BFS is available when ordered by “length.FatherNode”. One operation the nested sets can do faster is determining the number child-nodes just with a bit of math. Either way I think this idea is way more efficient than the one proposed in the book 🙂

Edit: I just found a book that is full of patterns for modeling the most common structures in relational databases: “SQL for Smarties – Advanced SQL Programming” by Joe Celko (who also wrote the article about the nested-set method for trees). It contains various graph problems and how to manage them in relational databases. Looks interesting…

Posted in Coding / Programming, Data Mining | No Comments »

What Machine Learning Papers to read …

Friday, July 13th, 2007

Laura just pointed me to this system, best described as:

I have a routine problem that sometimes paper titles are not enough to tell me what papers to read in recent conferences, and I often do not have time to read abstracts fully. This collection of scripts is designed to help alleviate the problem. Essentially, what it will do is compare what papers you like to cite with what new papers are citing. High overlap means the paper is probably relevant to you. Sure there are counter-examples, but overall I have found it useful (eg., it has suggested papers to me that are interesting that I would otherwise have missed). Of course, you should also read through titles since that is a somewhat orthogonal source of information.

http://www.cs.utah.edu/~hal/WhatToSee/

I have the same problem. And wow… I will have a lot to read this weekend.

Posted in Artificial Intelligence (AI), Machine Learning, Statistics | No Comments »

Safe Strings in PHP (2)

Sunday, July 1st, 2007

I wrote about the problems with PHP strings here and the possible solution I liked using a class encapsulating strings in PHP. I now worked out some details to make every string function in PHP work with the new “SafeString”-class. You can find the details and source here. This is still more a proof-of-concept and for all practical purposes would require the re-writing of a couple of things like database abstraction layers and such to return SafeStrings as well.

Posted in Coding / Programming, Security | 1 Comment »

Markus Breitenbach

Archive for July, 2007

A strange end for a night…

Modeling Tree Structures and Hierarchies in a Data Warehouse

What Machine Learning Papers to read …

Safe Strings in PHP (2)

Pages

Archives

Categories