Loading ...
Sorry, an error occurred while loading the content.

Re: Typical lexicon size in natlangs

Expand Messages
  • Logan Kearsley
    ... That s not a problem for counting how much vocabulary a particular language has. It *is* a problem for counting how much of a particular language s
    Message 1 of 59 , May 11, 2013
      On 11 May 2013 15:36, H. S. Teoh <hsteoh@...> wrote:
      > On Sat, May 11, 2013 at 01:53:10PM -0700, Gary Shannon wrote:
      >> On Sat, May 11, 2013 at 12:08 PM, Alex Fink <000024@...> wrote:
      >> >
      >> > For instance, it's a number bandied around that knowing 500 hanzi
      >> > will allow you to read 90% of the characters in a Chinese newspaper
      >> > -- but usually by people who don't appreciate the fact that this
      >> > includes all the grammatical and closed-class words, and a swathe
      >> > of basic lexis, but probably not the ěnteresting word or two in the
      >> > headline you care about.
      >> For example, if you know the most common 28 words in English you can
      >> read 50% of everything written. But what does THAT mean if 50% means
      >> that you can read only 50% of each sentence?
      >> Or, if you get really ambitious you can learn 732 words and read 90%
      >> of everything written in English. If you want to be able to read 99.9%
      >> of everything written in English you will need to learn 2090 words.
      >> (These figures are from my own million-word corpus taken from 20th
      >> century fiction and non-fiction on Gutenberg.com.)
      >> So what does it really mean to say you can read 90% by knowing 732
      >> words?
      >> Maybe the only meaningful measure of lexicon size is how many words
      >> you must know to cover some specified x% of the whole of the written
      >> corpus. That's a very different number for Toki Pona than it is for
      >> English. That way you could talk meaningfully about a specific
      >> language's "90% coverage lexicon", and its "98% coverage lexicon", and
      >> so on.
      > [...]
      > The problem with these percentages is that they obscure a basic fact of
      > information theory: the most information is conveyed by the most unusual
      > or outstanding bits. The stuff that's repeated almost everywhere has
      > very low information content. So if I can understand 50% of the most
      > common words in a given text, but most of that 50% is just grammatical
      > words, then I actually *don't* understand 50% of the information
      > conveyed by that text, but far less, probably only 5% or so. OTOH, if
      > of that 50% that I understand 40% are content words, then I may have a
      > far better understanding of the information conveyed by the text, even
      > if I'm ignorant of most of the grammatical particles and constructions.

      That's not a problem for counting how much vocabulary a particular
      language has. It *is* a problem for counting how much of a particular
      language's vocabulary you need to know, which might be more
      enlightening anyway.

      The last few weeks of my semantics class* that just ended in April
      were largely concerned with how to determine how much and exactly
      which vocabulary it is most essential to teach/learn for various
      purposes- "general service", general academic discourse, reading
      subject-field specific texts, etc. (most of the class was TESOL
      students, so this was rather an important topic for them).

      The vast majority of research on the topic is of the "which words make
      up some percentage of the text, ordered by frequency" variety, with a
      very little bit of supporting "how much comprehension do you get for a
      particular level of coverage", and very very little "what counts as a
      word" (which is surprisingly variable among different studies, and
      contributes to different vocabulary researchers getting somewhat
      different results).

      There is a general understanding that there's some group of words that
      especially needs to be taught/studied explicitly because they're
      important for comprehension but not frequent enough to be picked up
      casually, but no real general agreement as to what those are or how
      best to determine them. Frustratingly, there is no cutoff point at
      which learning more vocabulary starts to massively improve
      comprehension, or at which learning more vocabulary suddenly stops
      paying off- the graphs that come out of the few existing
      vocabulary-level vs. comprehension studies have annoyingly gentle

      I suspect that measuring the information content of different words in
      a language would not really get you drastically different results from
      just counting frequencies and coverages, since information content
      should be roughly inversely proportional to frequency. But as far as I
      know, that's never actually been done, so who knows, measuring
      information coverage rather than just straight token count coverage
      turn up some interesting things.

    • Juanma Barranquero
      ... Sure. But this thread discusses typical lexicon size , and Gary Shannon and H. S. Teoh proposed a bootstrap lexicon size as a meaningful measure. And
      Message 59 of 59 , May 20, 2013
        On Mon, May 20, 2013 at 5:32 PM, Anthony Miles <mamercus88@...> wrote:

        > Even in an impoverished environment humans or something like them will expand vocabulary.

        Sure. But this thread discusses "typical lexicon size", and Gary
        Shannon and H. S. Teoh proposed a "bootstrap lexicon size" as a
        meaningful measure. And I'm just pointing out that I don't think it
        would be a good metric, because if you use it for many languages, and
        the resulting size varies, let's say, between X-10% and X+10% for some
        X, that does not offer any insight about the *typical* lexicon size of
        the languages so tested. Systems of vastly different complexity can
        arise from similarly simple foundations (cellular automata are a clear
        example of that).

      Your message has been successfully submitted and would be delivered to recipients shortly.