Re: Spell suggestions with postponed prefixes
- Moshe Kaminsky wrote:
> * Bram Moolenaar <Bram@...> [30/06/05 00:29]:The mechanism simply considers every valid word, including prefixes.
> > I have now also implemented using postponed prefixes for making
> > suggestions. Currently only Hebrew uses this.
> > Please give it a try and see how well it works. I didn't do any
> > specific scoring for prefixes, since I don't know how to do that.
> > It does look like the suggestions are valid words or combinations with
> > allowed prefixes. But you better check that no suggestions are actually
> > wrongly spelled words.
> I checked it on several examples, it looks fine. I noticed that getting
> a suggestion which is a rare word is quite rare... I guess it can be
> I don't know if my original suggestions are used, but I saw the code a
> few days ago, and it appears to me there is an implementation which both
> simpler and should give better results: Simply run the state machine
> with both trees, the prefixes and the words, and when there is a word
> split, continue each time with the other tree. When passing from the
> word tree, there should be an actual splitting and a penalty, but not
> vice versa (when the prefixes are not postponed, both can point to the
> same tree). This way, the length of the prefix need not be considered,
> since long prefixes will by penalised anyway for being rare.
The edit distance from the badly spelled word is computed, only words
with a small distance are used. This doesn't take the length of the
prefix into account, it doesn't matter where the prefix stops and the
basic word starts. Would it be good to give a penalty to longer
prefixes? You would need to experiment with this, using a list of
actual misspellings. Just trying a few artificial misspellings may give
a wrong impression.
> > It's in the snapshot that I will upload in a couple of hours.The SAL mechanism is only useful if you can turn a word into its
> > I also implemented a different method for sound folding. It's simpler
> > and faster. I'm using it for Dutch to try out. Should be simple to add
> > to any language.
> Is it correct that the (original) SAL mechanism is mainly useful when
> there are combinations of several letters that sound like one? When
> there are only several letters such that one sounds similar to another
> (like c and k), the new method is equivalent to the original? Also, what
> is the advantage/disadvantage in this case over specifying, say,
> REP c k
> (I guess it saves writing when there are more than two that sound the
> same, but is there any other difference?)
"sound-a-like" equivalent. For English the mechanism is to leave out
all vowels and do some tricks with "th", "gh", etc. In Dutch I would
have "sch" sound the same as "s". In general the length of the
sound-a-like word is much less, thus more words look alike.
Using REP items with single letters isn't very useful, since that will
be tried anyway. It's only that they may get a slightly better score
that way. It counts a lot more when replacing several characters at
hundred-and-one symptoms of being an internet addict:
248. You sign your letters with your e-mail address instead of your name.
/// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
/// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ Project leader for A-A-P -- http://www.A-A-P.org ///
\\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///