Loading ...
Sorry, an error occurred while loading the content.

How to implement a prefix tree?

Expand Messages
  • Shlomi Fish
    How does one implement a prefix tree in a memory-efficient way? If I understand correctly a prefix tree contains one node for every prefix in the collection of
    Message 1 of 3 , Oct 25, 2003
    • 0 Attachment
      How does one implement a prefix tree in a memory-efficient way? If I
      understand correctly a prefix tree contains one node for every prefix in
      the collection of words, with the links between the nodes being the letter
      that are actually present. I.e: a prefix tree for a,ab,ad,bc would be:

      [Start] ---- a ---- ab----[Terminator]
      | |
      | | ---- ad----[Terminator]
      | |
      | ---- [Terminator]
      |
      -----b ---- bc----[Terminator]

      Now, the question is how to implement it efficiently. You can have for
      each node a vector of pointers, one for each character, as well as a place
      reserved for indicating that the word can terminate at this point. That
      would take 1 bit (for the terminator) and [Number of Characters] *
      [Pointer Width] (for the pointers) memory. If we have a 32-bit machine and
      256 characters that would imply a 1024 bytes plus one bit per node, which
      is quite a lot.

      Another way I can think of is to have an option for a binary tree of the
      sub-nodes, where empty nodes are pruned. But I wonder how the
      professionals do it.

      Regards,

      Shlomi Fish

      ----------------------------------------------------------------------
      Shlomi Fish shlomif@...
      Home Page: http://t2.technion.ac.il/~shlomif/

      An apple a day will keep a doctor away. Two apples a day will keep two
      doctors away.

      Falk Fish
    • Nadav Har'El
      ... Radix-trees (or Tries, or whatever your favorite book calls them) are well-known data structures, and your favorite algorithm book is bound to have several
      Message 2 of 3 , Oct 25, 2003
      • 0 Attachment
        On Sat, Oct 25, 2003, Shlomi Fish wrote about "[hackers-il] How to implement a prefix tree?":
        > Another way I can think of is to have an option for a binary tree of the
        > sub-nodes, where empty nodes are pruned. But I wonder how the
        > professionals do it.

        Radix-trees (or Tries, or whatever your favorite book calls them) are
        well-known data structures, and your favorite algorithm book is bound to
        have several algorithms to choose from.

        The main reason why the data structure you suggested "sucks" is because
        each one of your node is full, carrying pointers to all 256 characters.
        This is rarely needed in practice (it depends on your application of
        course; I'm thinking of a text-processing application, where character-
        based radix trees like you suggested are most often used).
        Another problem your data structure had was its use of 32 bit pointers,
        which are often an overkill (unless you expect a billion nodes in your
        tree) and are hard to "overload" with extra information (where do you plan to
        put that extra "terminator" bit?).

        In Hspell, the main data structure is indeed a radix-tree (a tree where
        a child is chosen based on the next character in a string), but one that
        was optimized to use a small amount of memory. Instead of each node having
        256 children, our data-structure has 4 types of nodes: leaf nodes (that take
        only 4 bytes of memory), small nodes (with only two children, take 16 bytes),
        medium nodes (with o8 children, takes 44 bytes) and full nodes (with the 29
        possible children and taking 120 bytes). We never have 256 children because
        we only deal with Hebrew letters and a few more.
        Nodes that are not full have a table of which characters they contain - making
        them slightly slower to search but still very quick (and most importantly,
        O(length of search string), just like hash tables that this radix-tree was
        meant to replace).

        Another important observation was that I don't actually need 32-bit pointers
        in the tree, because we'll never have billions of entries in the tree (we
        expect to have about a million). Instead, we use indices, and overload the
        highest two bits with bits to determine the type of node we're dealing with
        (we're left with 30 bits for our values and indices - more than enough for
        our needs).

        Using this sort of implementation, Hspell, with its 257,185 word dictionary
        takes about 26 bytes on average per node (a total of 3400K), or about
        12 bytes per word in the dictionary. This is quite better than any hash-table
        implementation I can think of, and is extremely fast to build, which is why
        I chose this data structure for Hspell. Note that that 12 bytes per word figure
        already includes a 30-bit value for each word (we use this to hold a bitmap of
        which prefixes can be used on this word), so the data-structure actually takes
        just 8 bytes per word in addition to the values. This is pretty good (I think).

        Since Hspell is free software, you are free to look at the code (check out
        the file c/dict_radix.c.

        As you can guess, however, it's possible to do better. If we had more than
        4 types of nodes, we could saved more memory because less space is wasted
        by empty children. With 3 bits overloaded, we could have 8 types of nodes;
        with 4 bits overloaded (leaving 28 bits for values and/or indices, still
        enough for most applications) we could have 16 types of nodes. But code
        doing all these cases quickly gets extremely ugly.
        A more reasonable method, that I haven't tried myself but saw in books, is
        ternary trees: when you have the next input character "g", you don't decide
        what to do based on just one node; Instead, say the node holds a character "e".
        You now go to one of three possible nodes based on whether "g" is smaller than,
        equal or larger than "e". Unless you chose the "equal" case, you continue
        with the same input character without going forward to the next one.
        Check out your favorite algorithm for the ternary tree algorithm, or if
        you can't find it ask me for a reference.

        Good luck.

        --
        Nadav Har'El | Saturday, Oct 25 2003, 30 Tishri 5764
        nyh@... |-----------------------------------------
        Phone: +972-53-790466, ICQ 13349191 |A cat has claws ending its paws. A
        http://nadav.harel.org.il |sentence has a pause ending its clause.
      • nyharel
        Robert Sedgewick, the same guy who wrote my favorite algorithms book, has a paper on Ternary trees and related stuff. Check out:
        Message 3 of 3 , Oct 26, 2003
        • 0 Attachment
          Robert Sedgewick, the same guy who wrote my favorite algorithms book,
          has a paper on Ternary trees and related stuff. Check out:

          http://www.cs.princeton.edu/~rs/strings/paper.ps

          I also found a rumor that such an algorithm is already included in
          "libiberty", whatever that is; Check out:
          http://gcc.gnu.org/ml/gcc-patches/2001-04/msg00840.html

          > A more reasonable method, that I haven't tried myself but saw in
          books, is
          > ternary trees: when you have the next input character "g", you don't
          decide
          > what to do based on just one node; Instead, say the node holds a
          character "e".
          > You now go to one of three possible nodes based on whether "g" is
          smaller than,
          > equal or larger than "e". Unless you chose the "equal" case, you
          continue
          > with the same input character without going forward to the next one.
          > Check out your favorite algorithm for the ternary tree algorithm, or if
          > you can't find it ask me for a reference.
          >
          > Good luck.
          >
          > --
          > Nadav Har'El | Saturday, Oct 25 2003, 30
          Tishri 5764
          > nyh@m... |-----------------------------------------
          > Phone: +972-53-790466, ICQ 13349191 |A cat has claws ending its paws. A
          > http://nadav.harel.org.il |sentence has a pause ending its
          clause.
        Your message has been successfully submitted and would be delivered to recipients shortly.