Monday, December 28, 2009

Notes on SortedSets in Java

One of the very common data structures used in HyperGraphDB is that of an ordered set of things, usually atoms as represented by their UUID handles. In particular, incidence sets are ordered sets of atom handles and there are usually a lot of them cached in an application, they are frequently traversed, intersected etc. So I needed to pay more attention to the data structure used for such sets.

A first attempt at an optimization was the trie data structure. A trie, however, very quickly proved to be a memory hog and in retrospect was a rather dull idea especially given the fact that currently UUIDs are randomly generated, and there's not really a big advantage in the number of low-level comparisons (which a trie is normally good for) even though our implementation of UUIDs forces a lexicographical ordering, unlike Java's which just compares 2 long numbers.

So next stop, our beloved binary trees, balanced if possible. In Java an ordered set is abstracted in the java.util.SortedSet interface. And Java provides a standard implementation based on red-black trees, which is just fine for most purposes. But if you look at the source code of Java's TreeSet class, you'd realize that it actually uses the TreeMap implementation which means that there would be an extra pointer stored with each element of your set (because TreeMap implements a sorted associative array whereas we just want a set). This extra pointer is one pointer too much. In addition , I though it would be nice to get rid of the parent pointers stored in each tree node as well!

And then I found the wonderful new work by one of the people who originally worked on that data structure and coined the term red-black trees, prof. Robert Sedgewick, a variation on the original theme called Left Leaning Red Black Trees (LLRB trees). A very accessible and detailed presentation shows how those trees are isomorphic to 2-3 or to 2-3-4 trees with a change of a few lines of code. The isomorphism is the following: a 4-node (a node with 4 children) is one whose two children are colored red, while a 3-node is one where exactly one of the children is red. So in a 3-node, you can have either a red left child or red right child. If you eliminate the latter, as a global constraint to be maintained by all operations on the tree, you have a perfect correspondence between the red-black tree and a 2-3-4 tree. That's where the left-leaning part of the name comes from: 3-nodes have a red child only on the left. If you always split 4-nodes (i.e. don't allow two children to be red), you end up with a correspondence with a 2-3 tree. It's beautiful :)

LLRB trees don't require parent pointers and the implementation is considerably shorter than that of a standard red-black tree. The implementation offered by prof. Sedgewick has some minor problems in the 2-3-4 version (confirmed by him in a personal email) which I've fixed in the HyperGraphDB implementation - essentially deletion failed when a 4-node resulted in two consecutive red links while going down the tree. Those were hard to discover and resolve, several methods needed adjustments, but the fix doesn't lengthen the implementation significantly.

Even though LLRB's implementation needs recursion and it does many rotations going up and down the tree, it performs as well as Java's TreeSet implementation. As a curiosity, while benchmarking the LLRB code against Java's TreeSet, I realized how expensive a typecast can be. The results showed that TreeSet's lookup was twice as fast. It turned out that was because of a '(Comparable)key' typecast in my implementation, that TreeSet bypasses by having one lookup method for user provided Comparator and another when the elements are themselves Comparable.

The LLRB implementation in HyperGraphDB might be of interest to you as it actually implements Java's SortedSet interface (except for subset views and I would be grateful if somebody has the patience to contribute that code...). It can be obtained from HyperGraphDB's svn: Note that this code has a dependency on HyperGraphDB's HGRandomAccessResult which extends java.util.Iterator and is implemented instead of a plain iterator, but you can easily scrap that part if you don't need it, or replicate it as your own interface in your application. Also, this class is thread-safe, which is an overhead you might want to get rid off as well (just remove the ReadWriteLock and all calls to it).

Now, when thinking about a data structure for ordered sets, trees seem the most natural choice. It's a no brainer, right? Well, no, it depends on actual usage. What if you are not planning on making a lot of inserts, compared to lookups and iteration over elements? Again, this is the use-case in HyperGraphDB: we want the smallest possible memory footprint and efficiency during traversals, lookups and intersections. This is because looking up links in the graph is essentially done by intersecting incidence sets (i.e. ordered sets of atoms). No matter what the algorithm is, intersections work by using some sort of cursor on the set and examining nearby elements. With an array, doing an is just an integer increment, but with a tree it's a lookup that can send you in a completely difference place on the tree. That has a huge performance impact if you are doing a lot of iterations over your dataset. So a SortedSet implementation backed by an plain array might be a good competitor. And in fact it is! Lookups take approximately the same time between trees and arrays, but iteration is an order of magnitude faster with arrays. In addition, I measured that for up to about 10000 elements, inserts in a tree-based implementation are not faster than the array-based one (that's with 32bit Java 5 on a 64bit, 4x2.4GHZ CPU Windows Vista machine).

The result of all this is that now HyperGraphDB uses a simple array-based implementation of the SortedSet to manage in-memory atom sets. This is also available from the SVN repository: Here too, you may need to get rid of HGRandomAccessResult and locks if you want to use that code.