How to implement a prefix tree?
- How does one implement a prefix tree in a memory-efficient way? If I
understand correctly a prefix tree contains one node for every prefix in
the collection of words, with the links between the nodes being the letter
that are actually present. I.e: a prefix tree for a,ab,ad,bc would be:
[Start] ---- a ---- ab----[Terminator]
| | ---- ad----[Terminator]
| ---- [Terminator]
-----b ---- bc----[Terminator]
Now, the question is how to implement it efficiently. You can have for
each node a vector of pointers, one for each character, as well as a place
reserved for indicating that the word can terminate at this point. That
would take 1 bit (for the terminator) and [Number of Characters] *
[Pointer Width] (for the pointers) memory. If we have a 32-bit machine and
256 characters that would imply a 1024 bytes plus one bit per node, which
is quite a lot.
Another way I can think of is to have an option for a binary tree of the
sub-nodes, where empty nodes are pruned. But I wonder how the
professionals do it.
Shlomi Fish shlomif@...
Home Page: http://t2.technion.ac.il/~shlomif/
An apple a day will keep a doctor away. Two apples a day will keep two
- On Sat, Oct 25, 2003, Shlomi Fish wrote about "[hackers-il] How to implement a prefix tree?":
> Another way I can think of is to have an option for a binary tree of theRadix-trees (or Tries, or whatever your favorite book calls them) are
> sub-nodes, where empty nodes are pruned. But I wonder how the
> professionals do it.
well-known data structures, and your favorite algorithm book is bound to
have several algorithms to choose from.
The main reason why the data structure you suggested "sucks" is because
each one of your node is full, carrying pointers to all 256 characters.
This is rarely needed in practice (it depends on your application of
course; I'm thinking of a text-processing application, where character-
based radix trees like you suggested are most often used).
Another problem your data structure had was its use of 32 bit pointers,
which are often an overkill (unless you expect a billion nodes in your
tree) and are hard to "overload" with extra information (where do you plan to
put that extra "terminator" bit?).
In Hspell, the main data structure is indeed a radix-tree (a tree where
a child is chosen based on the next character in a string), but one that
was optimized to use a small amount of memory. Instead of each node having
256 children, our data-structure has 4 types of nodes: leaf nodes (that take
only 4 bytes of memory), small nodes (with only two children, take 16 bytes),
medium nodes (with o8 children, takes 44 bytes) and full nodes (with the 29
possible children and taking 120 bytes). We never have 256 children because
we only deal with Hebrew letters and a few more.
Nodes that are not full have a table of which characters they contain - making
them slightly slower to search but still very quick (and most importantly,
O(length of search string), just like hash tables that this radix-tree was
meant to replace).
Another important observation was that I don't actually need 32-bit pointers
in the tree, because we'll never have billions of entries in the tree (we
expect to have about a million). Instead, we use indices, and overload the
highest two bits with bits to determine the type of node we're dealing with
(we're left with 30 bits for our values and indices - more than enough for
Using this sort of implementation, Hspell, with its 257,185 word dictionary
takes about 26 bytes on average per node (a total of 3400K), or about
12 bytes per word in the dictionary. This is quite better than any hash-table
implementation I can think of, and is extremely fast to build, which is why
I chose this data structure for Hspell. Note that that 12 bytes per word figure
already includes a 30-bit value for each word (we use this to hold a bitmap of
which prefixes can be used on this word), so the data-structure actually takes
just 8 bytes per word in addition to the values. This is pretty good (I think).
Since Hspell is free software, you are free to look at the code (check out
the file c/dict_radix.c.
As you can guess, however, it's possible to do better. If we had more than
4 types of nodes, we could saved more memory because less space is wasted
by empty children. With 3 bits overloaded, we could have 8 types of nodes;
with 4 bits overloaded (leaving 28 bits for values and/or indices, still
enough for most applications) we could have 16 types of nodes. But code
doing all these cases quickly gets extremely ugly.
A more reasonable method, that I haven't tried myself but saw in books, is
ternary trees: when you have the next input character "g", you don't decide
what to do based on just one node; Instead, say the node holds a character "e".
You now go to one of three possible nodes based on whether "g" is smaller than,
equal or larger than "e". Unless you chose the "equal" case, you continue
with the same input character without going forward to the next one.
Check out your favorite algorithm for the ternary tree algorithm, or if
you can't find it ask me for a reference.
Nadav Har'El | Saturday, Oct 25 2003, 30 Tishri 5764
Phone: +972-53-790466, ICQ 13349191 |A cat has claws ending its paws. A
http://nadav.harel.org.il |sentence has a pause ending its clause.
- Robert Sedgewick, the same guy who wrote my favorite algorithms book,
has a paper on Ternary trees and related stuff. Check out:
I also found a rumor that such an algorithm is already included in
"libiberty", whatever that is; Check out:
> A more reasonable method, that I haven't tried myself but saw inbooks, is
> ternary trees: when you have the next input character "g", you don'tdecide
> what to do based on just one node; Instead, say the node holds acharacter "e".
> You now go to one of three possible nodes based on whether "g" issmaller than,
> equal or larger than "e". Unless you chose the "equal" case, youcontinue
> with the same input character without going forward to the next one.Tishri 5764
> Check out your favorite algorithm for the ternary tree algorithm, or if
> you can't find it ask me for a reference.
> Good luck.
> Nadav Har'El | Saturday, Oct 25 2003, 30
> nyh@m... |-----------------------------------------clause.
> Phone: +972-53-790466, ICQ 13349191 |A cat has claws ending its paws. A
> http://nadav.harel.org.il |sentence has a pause ending its