When running the spell checker in NT Pro, I am, quite blindly, adding more and more words to my user dictionaries (UDT). Consequently, those UDT are continually growing and slow down the spell checker.
So I'm looking for a solution to automatically clean an UDT from words with low frequency. The target may be "frequency >= 2". That is, any entry in the UDT must occur at least two or more times in a corresponding reference corpus.
From my own experience, it's rather useless to approach this issue with checking a word list by counting with ^$StrCount$ like
^!If ^$StrCount("word";^$GetText$;1;1)$ > 1 etc...
Once you are dealing with some thousands of words and a large reference corpus this will be awfully slow.
Also, in my view, the NT Text Statistics is no tool for that job because it doesn't distinguish between upper und lower case (2 "report" + 3 "Report" are counted as 5 "report").
So my questions are...
1. Do you have any solution or, at least, any basic idea how to do that job?
2. Is anyone interested in sharing my experience in this field?
I've advanced that job a little bit with a solution that doesn't count single words but is based on comparing an UDT with a reference corpus processed with TextSTAT, see
It turns a given UDT into a new UDT on that "frequency >= 2" condition. Even with some thousands of words, it's done within a few seconds. But, probably, it could do with some testing and improvements.
Thanks for any idea!