On 11 May 2013 15:36, H. S. Teoh <hsteoh@...
> On Sat, May 11, 2013 at 01:53:10PM -0700, Gary Shannon wrote:
>> On Sat, May 11, 2013 at 12:08 PM, Alex Fink <000024@...> wrote:
>> > For instance, it's a number bandied around that knowing 500 hanzi
>> > will allow you to read 90% of the characters in a Chinese newspaper
>> > -- but usually by people who don't appreciate the fact that this
>> > includes all the grammatical and closed-class words, and a swathe
>> > of basic lexis, but probably not the ěnteresting word or two in the
>> > headline you care about.
>> For example, if you know the most common 28 words in English you can
>> read 50% of everything written. But what does THAT mean if 50% means
>> that you can read only 50% of each sentence?
>> Or, if you get really ambitious you can learn 732 words and read 90%
>> of everything written in English. If you want to be able to read 99.9%
>> of everything written in English you will need to learn 2090 words.
>> (These figures are from my own million-word corpus taken from 20th
>> century fiction and non-fiction on Gutenberg.com.)
>> So what does it really mean to say you can read 90% by knowing 732
>> Maybe the only meaningful measure of lexicon size is how many words
>> you must know to cover some specified x% of the whole of the written
>> corpus. That's a very different number for Toki Pona than it is for
>> English. That way you could talk meaningfully about a specific
>> language's "90% coverage lexicon", and its "98% coverage lexicon", and
>> so on.
> The problem with these percentages is that they obscure a basic fact of
> information theory: the most information is conveyed by the most unusual
> or outstanding bits. The stuff that's repeated almost everywhere has
> very low information content. So if I can understand 50% of the most
> common words in a given text, but most of that 50% is just grammatical
> words, then I actually *don't* understand 50% of the information
> conveyed by that text, but far less, probably only 5% or so. OTOH, if
> of that 50% that I understand 40% are content words, then I may have a
> far better understanding of the information conveyed by the text, even
> if I'm ignorant of most of the grammatical particles and constructions.
That's not a problem for counting how much vocabulary a particular
language has. It *is* a problem for counting how much of a particular
language's vocabulary you need to know, which might be more
The last few weeks of my semantics class* that just ended in April
were largely concerned with how to determine how much and exactly
which vocabulary it is most essential to teach/learn for various
purposes- "general service", general academic discourse, reading
subject-field specific texts, etc. (most of the class was TESOL
students, so this was rather an important topic for them).
The vast majority of research on the topic is of the "which words make
up some percentage of the text, ordered by frequency" variety, with a
very little bit of supporting "how much comprehension do you get for a
particular level of coverage", and very very little "what counts as a
word" (which is surprisingly variable among different studies, and
contributes to different vocabulary researchers getting somewhat
There is a general understanding that there's some group of words that
especially needs to be taught/studied explicitly because they're
important for comprehension but not frequent enough to be picked up
casually, but no real general agreement as to what those are or how
best to determine them. Frustratingly, there is no cutoff point at
which learning more vocabulary starts to massively improve
comprehension, or at which learning more vocabulary suddenly stops
paying off- the graphs that come out of the few existing
vocabulary-level vs. comprehension studies have annoyingly gentle
I suspect that measuring the information content of different words in
a language would not really get you drastically different results from
just counting frequencies and coverages, since information content
should be roughly inversely proportional to frequency. But as far as I
know, that's never actually been done, so who knows, measuring
information coverage rather than just straight token count coverage
turn up some interesting things.