RE: Spell checking using a Bloom filter
- Servatius Brandt wrote:
> The advantage of the trie is that you can read its contents. If "et"When going through the tree you only need a pointer to the first
> fails as a whole word, you can do it the other way round: look what
> continuations of "et" are stored in the trie (maybe just "et al.") and
> check whether one of those is actually used in the text. I didn't look
> on the spell code very deeply so far, but I bet that's what you are
> actually doing.
character. Then follow the nodes until a possible end-of-word is found
(NUL byte). This position is remembered, first the search in the tree
continues to find the longest match. Then all the remembered positions
are checked for an actual match (checking the case, region, affixes and
a non-word character following), starting with the longest possible word
and continuing for shorter words if there is no full match.
This is very efficient, especially since most words are good and will be
> > It's indeed surprising how efficient the trie works for some languages.For Vim it doesn't matter, since when using a .dic and .aff file
> > Polish indeed has lots of prefixes and suffixes. The problem of
> > handling them separately is that it becomes very complex. My previous
> > implementation with a hash table ran into this problem. Also keep in
> > mind that affixes can have non-word characters (French uses "d'" and
> > "l'" a lot).
> A colleage of mine is Czech, and he says all slavic languages use lots
> of prefixes and suffixes. Czech declination uses seven cases,
> Lithuanian even fifteen. So what's special with Polish?
> Does it matter whether the spell files make use of the .aff file to
> define the prefixes and suffixes or whether all the combinations are
> simply listed in the .dic file?
internally all combinations are generated and put in the tree. Reading
a list of words would result in exactly the same tree. If the files
specify the same words, of course.
Thus for someone maintaining the word list he can edit the .dic and .aff
files if that's easier than maintaining a full word list. There are
tools to convert one to the other, thus you can change your mind and/or
verify the results. You can also put words in the .dic file without an
affix, thus you can still list words if you want to.
Laughing helps. It's like jogging on the inside.
/// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
/// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ Project leader for A-A-P -- http://www.A-A-P.org ///
\\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///