Re: Speller data structures
- Olaf Seibert wrote:
> Recently I wrote about a trie data structure for spelling word lists.I now (finally!) had time to look closer at this code. I must say it
> There was some doubt as to the memory efficiency. Therefore I did a
> small test with the Polish wordlist, which was reputed to be 60
> megabytes. The one I found was only 40 but was 60 in some expanded form
> for Aspell. My file was only 1 megabyte.
looks good. The extra tricks to share the tails of the tree help a lot
to keep the storage low. For the languages that I tried the storage is
far less than what was used until now.
The code is a bit low on comments, thus it's difficult to understand
what exactly is going on. Hopefully I didn't understand something
Main drawback that I see is the linear search to lookup a character in a
list of siblings. Since most checked words will find a match, this will
be much slower than using a hash lookup.
I think the linear search can be replaced by a binary search. The
siblings are already sorted, this just requires storing the siblings in
a way to allow the binary search. This probably disallows sharing
siblings, but that is a small price to pay.
The code becomes a lot simpler, since all the stuff for affix
compression can be omitted. And the trouble with word vs non-word
characters is gone. This will avoid bugs and complexity.
Of course there are still a lot of things to do, such as figuring out an
efficient way to store multi-byte text, support for regions, flags for
How To Keep A Healthy Level Of Insanity:
9. As often as possible, skip rather than walk.
/// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
/// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ Project leader for A-A-P -- http://www.A-A-P.org ///
\\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///