Loading ...
Sorry, an error occurred while loading the content.
 

Speller data structures

Expand Messages
  • Rhialto
    Recently I wrote about a trie data structure for spelling word lists. There was some doubt as to the memory efficiency. Therefore I did a small test with the
    Message 1 of 3 , May 9 2:21 PM
      Recently I wrote about a trie data structure for spelling word lists.
      There was some doubt as to the memory efficiency. Therefore I did a
      small test with the Polish wordlist, which was reputed to be 60
      megabytes. The one I found was only 40 but was 60 in some expanded form
      for Aspell. My file was only 1 megabyte.

      I got the aspell polish dictionary from http://www.kurnik.pl/dictionary/
      (alt-aspell-pl-20050509.tar.bz2). I used aspell instead of myspell
      because I did have aspell installed but did not want to get the whole
      bulk of OpenOffice to get myspell.

      I started with an 8 MB file:

      -rw-r--r-- 1 rhialto wheel 8195901 May 9 06:49 pl.cwl

      I installed it and got a 60M pl.rws file:

      -rw-r--r-- 1 rhialto wheel 62668800 May 9 22:28 pl.rws

      Using "aspell -d pl dump master >words" I extracted the plain word list
      (about 40 M).

      -rw-r--r-- 1 rhialto wheel 40526289 May 9 22:41 words

      I compiled it to my trie data:

      alt-aspell-pl-20050509$ ./maketrie <words >words2
      Growing trie: 10000 -> 20000 nodes
      Growing trie: 20000 -> 40000 nodes
      Growing trie: 40000 -> 80000 nodes
      Growing trie: 80000 -> 160000 nodes
      Growing trie: 160000 -> 320000 nodes
      Growing trie: 320000 -> 640000 nodes
      Growing trie: 640000 -> 1280000 nodes
      Growing trie: 1280000 -> 2560000 nodes
      Growing trie: 2560000 -> 5120000 nodes
      Growing trie: 5120000 -> 10240000 nodes
      trie_fill: 8195820 nodes in use for 3073375 words 40526289 characters (0.202235)
      trie_compress: 7925612 nodes freed, 270208 left (0.032969)
      Overall compressed size: 0.006667
      270208 nodes in trie, 317619 nodes saved: ok (due to nodes for shared siblings)
      0.087919 cq 0.103345 nodes / word, 13.186249 chars / word
      47411 shared siblings, 152782 unshared siblings
      133627 shared children, 117425 unshared children
      using 1588095 bytes
      (would have used 2431872 bytes with explicit siblings)

      Stored on disk this takes just over 1 megabyte:

      -rw-r--r-- 1 rhialto wheel 1064376 May 9 22:32 bindump
      -rw-r--r-- 1 rhialto wheel 40526289 May 9 22:46 words2

      and in memory, as reported by the maketrie program, it takes 1.588.095
      bytes. You have to read it into memory all at once, the file is not
      random-accessible.

      Words2 is the same size and has the same number of lines (a quick sanity
      check), except it is sorted.

      $ ./readtrie </dev/null >words3
      test 1a: 0
      test 1A: 0
      test 2: 0

      -rw-r--r-- 1 rhialto wheel 40526289 May 9 22:50 words3

      $ diff words2 words3
      $

      (Actually, using Polish is kind-of cheating. Languages with fewer words,
      or languages that have less regular word endings, have a far lower
      compression ratio. Dutch or English wordlists probably are about the
      same size, on disk and in memory, as this Polish list of 3.073.375
      words.)

      -Olaf.
      --
      ___ Olaf 'Rhialto' Seibert -- rhialto/at/falu.nl
      \X/ Hi! I'm a signature virus! Copy me to your .signature to help me spread!
    • Bram Moolenaar
      ... That s a good result. The current file size for the Polish Vim .spl file is about 3 Mbyte. That includes flags and handling of non-word characters, thus
      Message 2 of 3 , May 15 6:52 AM
        Olaf Seibert wrote:

        > Recently I wrote about a trie data structure for spelling word lists.
        > There was some doubt as to the memory efficiency. Therefore I did a
        > small test with the Polish wordlist, which was reputed to be 60
        > megabytes. The one I found was only 40 but was 60 in some expanded form
        > for Aspell. My file was only 1 megabyte.

        That's a good result. The current file size for the Polish Vim .spl
        file is about 3 Mbyte. That includes flags and handling of non-word
        characters, thus it's not completely comparable.

        I'll have a better look at the code later. It looks like you could
        store a character as an int at a node to support Unicode. That should
        not increase the memory use much (struct size is often rounded to 4 bytes
        anyway).

        > (Actually, using Polish is kind-of cheating. Languages with fewer words,
        > or languages that have less regular word endings, have a far lower
        > compression ratio. Dutch or English wordlists probably are about the
        > same size, on disk and in memory, as this Polish list of 3.073.375
        > words.)

        Polish has many words that are alike. Thus this test may give a wrong
        impression. Can you do the same for English and/or Dutch?

        Before this could be used in Vim there would still be a lot of work
        (esp. for handling non-word characters). I'm not sure if it's worth the
        try to see if this approach works better than the current
        implementation. The Trie code doesn't look much simpler.

        - Bram

        --
        "Hegel was right when he said that we learn from history that man can
        never learn anything from history." (George Bernard Shaw)

        /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
        /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
        \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
        \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
      • Bram Moolenaar
        ... I now (finally!) had time to look closer at this code. I must say it looks good. The extra tricks to share the tails of the tree help a lot to keep the
        Message 3 of 3 , Jun 2, 2005
          Olaf Seibert wrote:

          > Recently I wrote about a trie data structure for spelling word lists.
          > There was some doubt as to the memory efficiency. Therefore I did a
          > small test with the Polish wordlist, which was reputed to be 60
          > megabytes. The one I found was only 40 but was 60 in some expanded form
          > for Aspell. My file was only 1 megabyte.

          I now (finally!) had time to look closer at this code. I must say it
          looks good. The extra tricks to share the tails of the tree help a lot
          to keep the storage low. For the languages that I tried the storage is
          far less than what was used until now.

          The code is a bit low on comments, thus it's difficult to understand
          what exactly is going on. Hopefully I didn't understand something
          wrong.

          Main drawback that I see is the linear search to lookup a character in a
          list of siblings. Since most checked words will find a match, this will
          be much slower than using a hash lookup.

          I think the linear search can be replaced by a binary search. The
          siblings are already sorted, this just requires storing the siblings in
          a way to allow the binary search. This probably disallows sharing
          siblings, but that is a small price to pay.

          The code becomes a lot simpler, since all the stuff for affix
          compression can be omitted. And the trouble with word vs non-word
          characters is gone. This will avoid bugs and complexity.

          Of course there are still a lot of things to do, such as figuring out an
          efficient way to store multi-byte text, support for regions, flags for
          allcaps, etc.

          --
          How To Keep A Healthy Level Of Insanity:
          9. As often as possible, skip rather than walk.

          /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
          /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
          \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
          \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
        Your message has been successfully submitted and would be delivered to recipients shortly.