Loading ...
Sorry, an error occurred while loading the content.

Re: Speller data structures

Expand Messages
  • Bram Moolenaar
    ... That s a good result. The current file size for the Polish Vim .spl file is about 3 Mbyte. That includes flags and handling of non-word characters, thus
    Message 1 of 3 , May 15, 2005
    • 0 Attachment
      Olaf Seibert wrote:

      > Recently I wrote about a trie data structure for spelling word lists.
      > There was some doubt as to the memory efficiency. Therefore I did a
      > small test with the Polish wordlist, which was reputed to be 60
      > megabytes. The one I found was only 40 but was 60 in some expanded form
      > for Aspell. My file was only 1 megabyte.

      That's a good result. The current file size for the Polish Vim .spl
      file is about 3 Mbyte. That includes flags and handling of non-word
      characters, thus it's not completely comparable.

      I'll have a better look at the code later. It looks like you could
      store a character as an int at a node to support Unicode. That should
      not increase the memory use much (struct size is often rounded to 4 bytes
      anyway).

      > (Actually, using Polish is kind-of cheating. Languages with fewer words,
      > or languages that have less regular word endings, have a far lower
      > compression ratio. Dutch or English wordlists probably are about the
      > same size, on disk and in memory, as this Polish list of 3.073.375
      > words.)

      Polish has many words that are alike. Thus this test may give a wrong
      impression. Can you do the same for English and/or Dutch?

      Before this could be used in Vim there would still be a lot of work
      (esp. for handling non-word characters). I'm not sure if it's worth the
      try to see if this approach works better than the current
      implementation. The Trie code doesn't look much simpler.

      - Bram

      --
      "Hegel was right when he said that we learn from history that man can
      never learn anything from history." (George Bernard Shaw)

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
      \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
    • Bram Moolenaar
      ... I now (finally!) had time to look closer at this code. I must say it looks good. The extra tricks to share the tails of the tree help a lot to keep the
      Message 2 of 3 , Jun 2, 2005
      • 0 Attachment
        Olaf Seibert wrote:

        > Recently I wrote about a trie data structure for spelling word lists.
        > There was some doubt as to the memory efficiency. Therefore I did a
        > small test with the Polish wordlist, which was reputed to be 60
        > megabytes. The one I found was only 40 but was 60 in some expanded form
        > for Aspell. My file was only 1 megabyte.

        I now (finally!) had time to look closer at this code. I must say it
        looks good. The extra tricks to share the tails of the tree help a lot
        to keep the storage low. For the languages that I tried the storage is
        far less than what was used until now.

        The code is a bit low on comments, thus it's difficult to understand
        what exactly is going on. Hopefully I didn't understand something
        wrong.

        Main drawback that I see is the linear search to lookup a character in a
        list of siblings. Since most checked words will find a match, this will
        be much slower than using a hash lookup.

        I think the linear search can be replaced by a binary search. The
        siblings are already sorted, this just requires storing the siblings in
        a way to allow the binary search. This probably disallows sharing
        siblings, but that is a small price to pay.

        The code becomes a lot simpler, since all the stuff for affix
        compression can be omitted. And the trouble with word vs non-word
        characters is gone. This will avoid bugs and complexity.

        Of course there are still a lot of things to do, such as figuring out an
        efficient way to store multi-byte text, support for regions, flags for
        allcaps, etc.

        --
        How To Keep A Healthy Level Of Insanity:
        9. As often as possible, skip rather than walk.

        /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
        /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
        \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
        \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
      Your message has been successfully submitted and would be delivered to recipients shortly.