Loading ...
Sorry, an error occurred while loading the content.

RE: Spell checking using a Bloom filter

Expand Messages
  • Bram Moolenaar
    ... When going through the tree you only need a pointer to the first character. Then follow the nodes until a possible end-of-word is found (NUL byte). This
    Message 1 of 6 , Jun 21, 2005
    • 0 Attachment
      Servatius Brandt wrote:

      > The advantage of the trie is that you can read its contents. If "et"
      > fails as a whole word, you can do it the other way round: look what
      > continuations of "et" are stored in the trie (maybe just "et al.") and
      > check whether one of those is actually used in the text. I didn't look
      > on the spell code very deeply so far, but I bet that's what you are
      > actually doing.

      When going through the tree you only need a pointer to the first
      character. Then follow the nodes until a possible end-of-word is found
      (NUL byte). This position is remembered, first the search in the tree
      continues to find the longest match. Then all the remembered positions
      are checked for an actual match (checking the case, region, affixes and
      a non-word character following), starting with the longest possible word
      and continuing for shorter words if there is no full match.

      This is very efficient, especially since most words are good and will be
      found quickly.

      > > It's indeed surprising how efficient the trie works for some languages.
      > [...]
      > > Polish indeed has lots of prefixes and suffixes. The problem of
      > > handling them separately is that it becomes very complex. My previous
      > > implementation with a hash table ran into this problem. Also keep in
      > > mind that affixes can have non-word characters (French uses "d'" and
      > > "l'" a lot).
      >
      > A colleage of mine is Czech, and he says all slavic languages use lots
      > of prefixes and suffixes. Czech declination uses seven cases,
      > Lithuanian even fifteen. So what's special with Polish?
      >
      > Does it matter whether the spell files make use of the .aff file to
      > define the prefixes and suffixes or whether all the combinations are
      > simply listed in the .dic file?

      For Vim it doesn't matter, since when using a .dic and .aff file
      internally all combinations are generated and put in the tree. Reading
      a list of words would result in exactly the same tree. If the files
      specify the same words, of course.

      Thus for someone maintaining the word list he can edit the .dic and .aff
      files if that's easier than maintaining a full word list. There are
      tools to convert one to the other, thus you can change your mind and/or
      verify the results. You can also put words in the .dic file without an
      affix, thus you can still list words if you want to.

      --
      Laughing helps. It's like jogging on the inside.

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
      \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
    Your message has been successfully submitted and would be delivered to recipients shortly.