Modeling Tree Structures and Hierarchies in a Data Warehouse

I’m currently reading a book about Data Warehouse design (“Mastering Data Warehouse Design”, Claudia Imhoff, Nicholas Galemmo and Jonathan Geiger, Wiley). One thing I noticed is the incredibly inefficient way the authors encode trees in relational databases. Their suggestion is to model it with “pointers” to child-nodes which is incredibly inefficient to deal with in SQL, leads to recursive queries (unless proprietary SQL extensions are used) and you’ll have to write loads of self-joins. A much better way of encoding trees in SQL is based on nested sets. However, it always depends on what kind of queries you will run later. According to the book, those would be listing elements on the same level of the tree as well as retrieving sub-trees. This is something that I think is still kind of painful when using nested sets.

Here is my favorite solution for encoding trees in SQL:

CREATE TABLE TreeMagic (Mykey CHAR(10) PRIMARY KEY, FatherNode ChAR(10) NOT NULL, length INTEGER NOT NULL);

MyKey	FatherNode	Length
A		1
AA	A	2
AB	A	2
AAA	AA	3
AAB	AA	3
ABA	AB	3
ABB	AB	3

So key A is the root, AA and AB are child-nodes of A, AAA and AAB are child-nodes of AA, and so on. The cool part is traversing the tree on one level is easy due to the length-field and selecting subtrees is easy as well using the like-operator, which moves all the hard work into the B-Tree index. Inserting a new node is simpler than with set-based trees for which in the worst case you might have to increase the left/right numbers at a couple of other nodes. The path to each node is always fully known. Traversing the tree in DFS comes for free by lexicographical ordering of the keys with an “order by” clause. BFS is available when ordered by “length.FatherNode”. One operation the nested sets can do faster is determining the number child-nodes just with a bit of math. Either way I think this idea is way more efficient than the one proposed in the book 🙂

Edit: I just found a book that is full of patterns for modeling the most common structures in relational databases: “SQL for Smarties – Advanced SQL Programming” by Joe Celko (who also wrote the article about the nested-set method for trees). It contains various graph problems and how to manage them in relational databases. Looks interesting…

This entry was posted on Friday, July 20th, 2007 at 9:25 pm (July 20, 2007) and is filed under Coding / Programming, Data Mining. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

You must be logged in to post a comment.

Markus Breitenbach

Modeling Tree Structures and Hierarchies in a Data Warehouse

Leave a Reply