On Sat, 11 May 2013 08:10:48 -0400, Jim Henry <jimhenry1973@...
>George Corley, I think it was, suggested a less arbitrary way to
>filter out the archaic words and specialized jargon than simply
>declaring a certain date cut-off or marking certain semantic domains
>off-limits. He suggested taking a large corpus of recent texts and
>looking for the set of most frequent words that constitute 90% (or
>80%, or whatever) of those texts. That would give you an idea of the
>core vocabulary of a specific language -- the set of words that many
>or most speakers use frequently -- without using arbitrary
>cross-linguistic standards like the Swadesh List. You can set the
>figure to 95% or even 99%, as long as you use the same figure for all
>the languages whose corpora you're comparing.
Indeed, that's probably one of the less arbitrary ways to actually generate a number in practice. But in view of other figures I've seen, I suspect any of those thresholds will probably yield drastic undercounting compared to the kind of numbers you'd like for speakers' mental lexicon size (though maybe 99% begins to get close). For instance, it's a number bandied around that knowing 500 hanzi will allow you to read 90% of the characters in a Chinese newspaper -- but usually by people who don't appreciate the fact that this includes all the grammatical and closed-class words, and a swathe of basic lexis, but probably not the ìnteresting word or two in the headline you care about.
In fact, I wonder how much variation there would be from language to language in the rate at which this number of words varies with the cutoff -- e.g. the exponent if you fit the growth to a power law. That seems like it should be even less subject to irrelevant factors.
>Of course, that still leaves some arbitrary decisions about marking
>word boundaries in your corpus before you parse it. And for some
>languages, a larger corpus will be available than for others. But I
>think it should give a less arbirary, more comparable method of
>comparing different languages than simply counting entries in
>dictionaries, when the lexicographers working with different languages
>may have been using very different design principles and had different
>resources available to them.
Another concern with corpus methods is that, if you want to disregard inflectional morphology when deciding what counts as the same word (and surely you do?), you still need a good stemmer for the language in question. But if you accept that, it's not possible to completely avoid judģment calls on what's inflection and what's derivation; or even if there are no judģment calls, it might rely too much on semantic understanding for the software to be able to do it. (Drifting completely away from objectivity, my own proclivity would be to answer this question fuzzily. E.g. once you know the English word "build", you don't get the nominal sense of "building" completely for free, but neither is it a wholly separate word of its own; perhaps it should count as one half of a word?)