Loading ...
Sorry, an error occurred while loading the content.

Re: Speller data structures

Expand Messages
  • Bram Moolenaar
    ... I now (finally!) had time to look closer at this code. I must say it looks good. The extra tricks to share the tails of the tree help a lot to keep the
    Message 1 of 3 , Jun 2, 2005
    • 0 Attachment
      Olaf Seibert wrote:

      > Recently I wrote about a trie data structure for spelling word lists.
      > There was some doubt as to the memory efficiency. Therefore I did a
      > small test with the Polish wordlist, which was reputed to be 60
      > megabytes. The one I found was only 40 but was 60 in some expanded form
      > for Aspell. My file was only 1 megabyte.

      I now (finally!) had time to look closer at this code. I must say it
      looks good. The extra tricks to share the tails of the tree help a lot
      to keep the storage low. For the languages that I tried the storage is
      far less than what was used until now.

      The code is a bit low on comments, thus it's difficult to understand
      what exactly is going on. Hopefully I didn't understand something
      wrong.

      Main drawback that I see is the linear search to lookup a character in a
      list of siblings. Since most checked words will find a match, this will
      be much slower than using a hash lookup.

      I think the linear search can be replaced by a binary search. The
      siblings are already sorted, this just requires storing the siblings in
      a way to allow the binary search. This probably disallows sharing
      siblings, but that is a small price to pay.

      The code becomes a lot simpler, since all the stuff for affix
      compression can be omitted. And the trouble with word vs non-word
      characters is gone. This will avoid bugs and complexity.

      Of course there are still a lot of things to do, such as figuring out an
      efficient way to store multi-byte text, support for regions, flags for
      allcaps, etc.

      --
      How To Keep A Healthy Level Of Insanity:
      9. As often as possible, skip rather than walk.

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
      \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
    Your message has been successfully submitted and would be delivered to recipients shortly.