39720Re: Speller data structures
- May 15, 2005Olaf Seibert wrote:
> Recently I wrote about a trie data structure for spelling word lists.That's a good result. The current file size for the Polish Vim .spl
> There was some doubt as to the memory efficiency. Therefore I did a
> small test with the Polish wordlist, which was reputed to be 60
> megabytes. The one I found was only 40 but was 60 in some expanded form
> for Aspell. My file was only 1 megabyte.
file is about 3 Mbyte. That includes flags and handling of non-word
characters, thus it's not completely comparable.
I'll have a better look at the code later. It looks like you could
store a character as an int at a node to support Unicode. That should
not increase the memory use much (struct size is often rounded to 4 bytes
> (Actually, using Polish is kind-of cheating. Languages with fewer words,Polish has many words that are alike. Thus this test may give a wrong
> or languages that have less regular word endings, have a far lower
> compression ratio. Dutch or English wordlists probably are about the
> same size, on disk and in memory, as this Polish list of 3.073.375
impression. Can you do the same for English and/or Dutch?
Before this could be used in Vim there would still be a lot of work
(esp. for handling non-word characters). I'm not sure if it's worth the
try to see if this approach works better than the current
implementation. The Trie code doesn't look much simpler.
"Hegel was right when he said that we learn from history that man can
never learn anything from history." (George Bernard Shaw)
/// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
/// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ Project leader for A-A-P -- http://www.A-A-P.org ///
\\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
- << Previous post in topic Next post in topic >>