Weight Balanced Trees
Searching for self balancing trees on the internet will yield plenty of results for red-black and AVL trees, but almost no results for weight balanced trees. This is unfortunate, as weight balanced tree's provide the same logarithmic performance as red-black and AVL trees, with one important benefit: they can also index elements in logarithmic time.
For instance, finding the i'th element of an AVL tree will take O(n) time. But with a weight balanced tree, it will only take O(log(n)) time.
The full implementation can be downloaded here: WBTree.py
And this is an example of using the tree:
Console output:
The tree also accepts custom sorting functions and can be toggled to accept (or reject) duplicate elements
Performance
We've compared WBTree's performance against various other self balancing trees. The data was generated using speedtest.py in testing.zip using the python3 runtime. Our own timing is denoted by "Dee".
Timing was conducted by
1. | Adding 100,000 nodes to a tree |
2. | Searching for each node |
3. | Removing all nodes from the tree |
If an implementation didn't support searching or removing, those timings were omitted.
Algorithm | Rank | Add (s) | Remove (s) | Search (s) |
Pathak AVL | 11 | 2.25 | - | - |
Yoo RB | 10 | 0.82 | - | - |
Girish AVL | 9 | 5.27 | 3.55 | - |
Reid RB | 8 | 1.46 | - | 0.77 |
Stanisl RB | 7 | 1.28 | 1.78 | 1.19 |
Dee WBT | 6 | 1.82 | 1.63 | 0.49 |
Pgrafov AVL | 5 | 1.95 | 1.39 | 0.46 |
Mozman RB | 4 | 1.21 | 1.98 | 0.25 |
Mozman AVL | 3 | 1.47 | 1.02 | 0.26 |
Msingh RB | 2 | 1.02 | 0.84 | 0.61 |
Dee AVL | 1 | 1.03 | 0.77 | 0.47 |
Dee WBT is the algorithm this article focuses on, but it's obvious it doesn't perform as well as the other trees. There are a few reasons for this:
• | It allows for a custom search function, which slows searching down. |
• | Rebalancing is slightly slower due updating weights and how it compares sibling nodes. |
It's important to remember that the weight balanced tree is still bounded by O(log(n)) for all of the normal add(), remove(), and search() operations. So even though its relative performance isn't good, it will never become pathologically bad.
Explanation
Imagine keeping track of the residents in a city. As people decide to move to the city, or move elsewhere, we'd like to update our ledger of residents and also query basic information about them as fast as possible. An array would work, but its O(n) time would make it extremely slow as more people decide to move to the city. This is where self balancing tree's come in. Most of the common operations we'll perform will take O(log(n)) time, so even a city of 1 million people will only need about 20 operations to update its ledger.
All of the common self balancing trees have a similar structure: nodes with 2 children per node, a root node, and some balancing mechanism.
Where the types of trees differ is in their balancing mechanism:
Red-Black: | Traveling to any descendant node will cross the same number of black nodes. |
AVL: | The heights of child nodes will be kept within ±1 of eachother. |
Weight: | The weights of child nodes will be kept within a ratio of 2.5. |
In our weight balanced tree we define the weight as the number of descendants of a node, plus the node itself. Mathematically, we can define the weight recursively as left.weight + right.weight + 1. We also define null nodes to have a weight of 0.
We will omit drawing null nodes from now on in order to avoid clutter. If a node is drawn without a child, then it can be assumed that the child is null.
The tree will focus on maintaining two properties: the relative weights of 2 children should have a ratio of 2.5 or less, and nodes will be in ascending order from left to right. For instance, in the tree below, all children to the left of the root node are less than 7, and all children to the right are greater than 7.
The importance of a self balancing tree is that we may add and remove nodes from the tree while maintaining a logarithmic bound on the tree's height and a sorted ordering on the nodes. This allows us to dynamically track and query data relatively quickly even on large data sets.
We are now prepared to implement the various operations that will maintain the tree's logarithmic height, and also the operations that can make use of the tree once it's populated.
Rotations
Rotations are the fundamental operation that trees use to rebalance nodes. They work by swapping a node's position with one of it's children and, in the process, raise one of its children and lower another. There are two type of rotations that can be performed: left rotations and right rotations.
Let us consider a right rotation on the node A below.
Note that node B will take the place of node A, wherever A was in the tree. It can be seen that this right rotation has the effect of shifting the majority of nodes from the left side of the tree to the right. By doing this, the relative weights of 2 children can be kept close, and it is the crux of the tree's logarithmic performance. The rotation above is what we'll use to rebalance a right-leaning node. The rotation also preserves the sorted order of the nodes by maintaining the following order:
x is to the left of B |
B is to the left of y |
y is to the left of A |
A is to the left of z |
Likewise, a left rotation on node A looks like so:
Left rotations can conversely be used to rebalance a node which is leaning to the right.
The python code to perform these rotations is given below. Note that the weights of nodes A and B need to be recalculated after each rotation, since their childs' weights may have changed.
We will show a concrete of example of how rotations are used to rebalance a tree in the next section.
Balancing
As we add and remove nodes from the tree, the tree will become lopsided; with some branches becoming extremely heavy and some extremely light. In order to maintain an O(log(n)) worst-case performance bound on tree operations, the tree will need to be rebalanced periodically.
Ideally, in a perfectly balanced tree, all children will have the same weight. However, since we're working with a tree that can be updated dynamically, we will have to allow some slack in balancing the tree in order to balance in a timely manner. That is, if we're too strict with rebalancing, the act of rebalancing will take longer than any time we might save by using a self balancing tree.
For our tree, we will define a node to be balanced if its left and right child weights are within a ratio of 2.5. That is 1 / 2.5 ≤ left.weight / right.weight ≤ 2.5. However, defining the actual inequality in python is fairly tricky, and even slight errors can lead to trees that become imbalanced. The actual inequality we will use to detect an imbalance only involves integer math:
Choosing other values could lead to the balancing invariant being broken over time, or lead to trees with worse balancing ratios. Hirai et al cover research into other invariants that can be used.
To actually rebalance a node, we will make use of rotations. Note from the previous section that left and right rotations shift weights differently. In particular, rotating to the left tilts the node's children to the left, and rotating to the right tilts them to the right. In some situations, where the middle node (denoted by y above) is too heavy, simply performing one rotation won't balance the tree. In those situations, we'll need to perform a double rotation. Thus, for a node leaning to the left, we have the following cases:
Single and double rotations will handle every type of imbalance we'll encounter, so there's need to introduce any more. We'll first show what a single rotation looks like:
Single rotations can fail to balance the node if the middle node is too heavy. If this is the case, then we must left rotate the node's left child before actually rotating the node. The example below shows what would happen if we didn't make this correction.
Rotating the left child beforehand results in a correctly balanced tree.
The cases where a node is leaning to the right are handled exactly the same, only all "left" and "right" references are swapped, including rotations.
The code below is our full rebalancing function. In order to make sure all node weights are correct, we need to rebalance from the node we've added (or removed) all the way up to the root.
Searching
When defining self balancing trees, we stated that any child to the left of a node will satisfy node.value >= child.value, and any child to the right will satisfy node.value <= child.value. While this property allows us to search for nodes, it also presents a problem since the tree can have multiple nodes with the same key. In such a case, which node should we return when searching for a key?
For the purposes of this article, we will have the search function return the last node going by sorted order. Thus, the rules it will follow are:
1: | Start at the root node. |
2: | If node.value <= value, then move to the node's right child. |
3: | Otherwise, move to the node's left child. |
4: | Continue until we reach a null node. |
5: | Return the last node with node.value = value. |
For example, assume we want to search for value = 5 in the following tree.
The python code below shows how our basic search function works.
In WBTree.py, the search function has been expanded to allow us to query different ranges in the tree. They include:
• | Return the least node = value. |
• | Return the least node > value. |
• | Return the least node >= value. |
• | Return the greatest node = value. |
• | Return the greatest node < value. |
• | Return the greatest node <= value. |
Adding
Adding nodes is as simple as finding where the node would be if we searched for it, adding it as a child, and then balancing up. Because we defined our search function to go right whenever node.value <= value, this will stably (in the sorting sense) add the node to the tree.
For instance, assume we want to add value = 4 to the following tree:
We would then balance from node 3 upwards until we reach the root node.
The code for adding values is given below. Note that this function will always add a new value. Implementations that require duplicate additions to replace old values will need to modify the add() function.
Traversal
Given a particular node in the tree, we may be interested in finding the next or previous node going by sorted order. That is, if this tree was a sorted array, what would the next and previous elements be? Answering this question will help us not only with navigating a tree, but also with removing nodes in the next section.
Finding next(node) for some node can be split into 2 cases:
1. | If node.right is not null, take node.right's left children until we reach a node with no left child. |
2. | If node.right is null, follow the node up the tree until we move up a left branch. |
We can illustrate these movements in the following diagram:
This procedure yields the following python function for next():
The prev() function is exactly the same as next(), except we swap all instances of node.left with node.right:
Removing
Removing a node involves replacing all links to the node with that of its successor, next(node), and then rebalancing. It would be simpler to swap the node's value with that of its successor, and then remove the successor, but this would invalidate any external pointers a user might have to the successor node. Indeed, a user might find it odd that the node that was supposed to be removed still exists, and another node has been deallocated without warning.
There are 3 cases to consider when removing a node.
1. | The node has no right child. |
2. | The successor is the right child of the node. |
3. | The successor is a distant descendant of the node. |
If the node has a null child, then we simply replace the node with its other child and rebalance from the node's parent upwards. For instance, if we want to remove node 5 in the diagram below.
If a right child exists, then we know that the next node will be a child on the right-hand side of the node. We also know that, by definition, the next node will have no left child. Thus, for case 2, if the next node is the node's right child, then we can replace the node with the next node without having to worry about shuffling around either nodes' children.
For case 3, the next node will be a distant child of the node. So, we must first remove the next node (as in case 1), and then replace the node with the next node.
The complete python code for removing a given node is shown below.
Indexing
The biggest benefit to balancing by weights (instead of red-black or AVL) is that weights grants us the ability to index nodes in O(log(n)) time, even as we are modifying the tree.
We'll index nodes in the same way that we would index values if they were sorted from least to greatest. So index 0 would be the smallest value in the tree, and index n-1 would be the greatest. We also need to remember that children to the left of the node have values equal to or less than the node. Putting all that together, we'll follow these simple rules:
1. | If index > left.weight, go to the right and subtract index -= left.weight + 1. |
2. | If index = left.weight, return the current node. |
3. | Otherwise go to the left. |
Using the tree below, we can show how to find the 10'th element in the tree. Note that all of the numbers we'll be using will refer to weights.
The code for this ends up being fairly simple.
To find the index of a specific node, we simply do the opposite. We trace the path from a node to the root. Every time our node is a right descendant, we add the weight of the left child plus 1 to our index.
Time Complexity
Proving that the weights of two children will always stay within a ratio 2.5 is fairly difficult. We will instead be taking this balance as fact, and simply proving that the height of the tree is bound by h <= 2.07 * log2(n).
Let N be the weight of the node, and let L and R be the weights of its two children. For simplicity, we'll assume N = L + R. Our first step is to bound the weight of either of the two children by N.
Thus, the maximum weight of a child is 1 / 1.4 the weight of the parent. We now need to see at what height the weight of a child becomes 1.
Thus the height is logarithmically bound, as desired.
Operational Time Complexity
Now, we can show that the worst-case time complexity for our operations is also O(log(n)).
Rebalancing only moves up the tree, which is O(log(n)), and performs 1 or 2 rotations per node, which are O(1). Thus rebalancing is O(log(n)).
Searching only moves down the tree, which is O(log(n)), and performs one comparison per node, which is O(1). Thus searching is O(log(n)).
Next/prev must either travel all the way up the tree or down the tree, which is O(log(n)). The average number of operations per next/prev is 2.
Adding a node involves searching for a node's destined position, which is O(log(n)), and rebalancing, which is O(log(n)). Thus adding nodes is O(log(n)).
Removing a node involves at most finding the node, which is O(log(n)), finding the next node, which is O(log(n)), swapping, which is O(1), and balancing, which is O(log(n)). Thus, removing a node is O(log(n)).
Finding a node given an index a node involves comparing weights and moving down the tree, which is O(log(n)). Finding an index given a node only moves up the tree, which is also O(log(n)). Thus indexing is O(log(n)).
Notes
Better balancing is possible by avoiding ratios entirely. Instead, we can attempt a double or single rotation and see which one results in a better balance. Research is needed to simplify when balancing is performed and what its worst case balance is. I've left some notes in testing.zip.
Unrolling rebalance() speeds up the tree dramatically, but I omitted it from the article and reference implementation because it's much more complicated. The unrolled version can also be found in testing.zip.
For environments with finite integer sizes, performing the balance comparison without overflows is possible, but it's slower and obscures the reference implementation too much.
I've also implemented an AVL tree for education and to compare speeds. Its ability to terminate early gives it a speed advantage over weight balanced trees.
I'd like to implement set theory operations in the future. Until I actually need them, I don't know what I'd like the API to look like, so I'll hold off for now.