Over the past year I’ve been heavily involved with writing compilers, working with HTML DOMs, working with decision trees, etc. Trees really are everywhere in the programming world, if you step outside of the regular business logic coding. So I’d like to share a few tips and tricks when working with complex systems that involve trees.
1. Use IDs + arenas, not pointers
Often I see people implementing trees directly with pointers, i.e. every Node contains its content, a pointer to the parent + one or more pointers to children. This is bad, because of several reasons:
1. Pointers can’t be checked to point to a valid tree – maybe the parent ore the children are already deleted, and the pointer is invalid – who knows?
2. Pointers can’t be checked to belong to a specific tree or be bounds-checked (because with pointers, there are no “bounds”).
3. Pointers lead to lots of cache misses when transforming or iterating trees in a linear fashion – want to update some information in the whole tree? Tough luck, says your CPUs L2 cache.
Effects of different data layouts on the L2 cache
The two most common operations on a tree are, at least in my experience:
● Check if the parent has some kind of property
● Do some operation on all children, regardless of hierarchy
Imagine iterating hierarchically through all the nodes. This means that the nodes are potentially in an unstructured order – i.e. when iterating in the order of Node 1 – 5 – 10 – 13, less nodes are cached when the content of the node is directly attached to the hierarchy information. Especially as nodes get bigger (> 100 bytes), this can become increasingly important.
Lastly, doing this allows you to re-use the hierarchy of the tree (or parts of it) independent of the content, in-between different trees. Now you might not say that this is much data, but imagine a 40-byte “header” (containing the IDs for the parent, next child, last child, previous and next sibling @ 8 bytes each) for an 200-byte node content. If you have 10.000 nodes, this means, you have 10.000 * 40 = 390KB of duplicated information per tree. Now imagine having five or six different trees – that’s 2 megabytes of redundancy.
4. IDs are as efficient as pointers and can do everything that pointers can – in the end, pointers are just numbers. When building release versions of your program, simply un-check “bounds-checking” and boom, same performance as pointers.
5. It allows you to diff trees (to avoid heavy computation of sub-trees) and cache information. With pointers (which can be random numbers), it is very hard to cache them.
2. Don’t mutate trees, just make new ones
Mutating trees can very easily lead to bugs – even when working with bounds-checked IDs instead of pointers – instead of getting a segfault, now you get an out-of-bounds exception: not much better. So my suggestion towards managing many trees and subtrees is to try building a “data pipeline”, as far as your resources allow it (mutating node content in-place is less expensive than creating a new node content container).
But here is where private fields and getters get important: By making the arena private, it allows you to only allow indexing the current tree by a certain ID type. Meaning, a ClipNodeTree can only be indexed by a ClipNodeId, a HtmlNodeTree only be indexed by a HtmlNodeId, and so on. If your language of choice doesn’t allow for overloading the index operator – simply create a getter method that can do the same (indexing into the tree). This way you avoid using the wrong index type on different trees or using an outdated index.
More advanced concepts include generational IDs (where each ID and tree also carries its “generation” and outdated IDs are ignored). But don’t get stuck on trying to manage only one tree. It’s nearly impossible.
3. Create passes
When processing complicated tree structures, it’s a good idea to break down the process into “passes”.
Creating passes to manage complexity in data pipelines
For example, when creating the layout for a styled tree, there are usually seperate passes:
1. Calculating how much space is absolutely required for a certain node (the minimium content width)
2. Calculating the minimum and maximum widths of each item. While a basic solver takes just min-width, width and max-width into account, it’s also important to look at the preferred width of items – for example, a 1920×1080 image might “want” to be 1920 pixels wide, but if the max-width is constrained to 700px, then the image can only be 700px wide.
3. Maximizing the width of items. This was probably the hardest part – when distributing the remaining width among items, how do you respect max-width constraints? Because if you distribute the space and you go over the max-width of an item, then you have to “pull back” all other items again, which could then violate the min-width of some other node, quickly leading to an O(n²) situation. However, this is fairly easily solvable if you remember, that a max-width just means “max width” – it’s not the preferred or exact width. Meaning, a width of 0px technically satisfies any max-width constraint. So the solution to this problem is to simply minimize all constraints (setting items that don’t have a min-width to 0px), then distribute the space until no max-width are violated.
4. Layouting the height – this has to be done after the width, since items like text and images depend on the width, in order to solve how high they are. For the aforementioned image, the height will be 700 / 1920 * 1080 = 393 pixels (so that the aspect ratio is preserved). Text, on the other hand, has a natural height, but you can only determine the height of a text if you know the width.
While this applies specifically to layout trees, the takeaway for solving these kinds of problems is fairly simple: Minimize first, then maximize later.
Struct sharing is a technique I hadn’t heard about until I read this very informative post on how browsers render a page. Although this page is rather outdated, it does provide invaluable information for working with trees.
Struct sharing allows you to re-use node content is among nodes in a tree. For example, if you have a list of styles that apply to a lot of nodes, then you can seperate those (shared) styles out and keep only one copy of them in a cache. This is different to reference-counted styles, since cascading may merge styles together, so it can only work if two nodes have the exact same style.
It also depends on how large your nodes data is and if you can re-use components. Is it useful in practice? Maybe, it depends on how much memory you have available.
These are my random notes on my hands-on experiences when working with trees, specifically layout and rendering trees. Not all of it is applicable to all applications, but I hope you enjoyed this post anyways. Have a nice day!