Loading ...
Sorry, an error occurred while loading the content.

196779Re: Typical lexicon size in natlangs

Expand Messages
  • Gary Shannon
    May 11, 2013
    • 0 Attachment
      On Sat, May 11, 2013 at 12:08 PM, Alex Fink <000024@...> wrote:
      >
      > For instance, it's a number bandied around that knowing 500 hanzi will allow you
      > to read 90% of the characters in a Chinese newspaper -- but usually by people
      > who don't appreciate the fact that this includes all the grammatical and closed-class
      > words, and a swathe of basic lexis, but probably not the ├Čnteresting word or two
      > in the headline you care about.

      For example, if you know the most common 28 words in English you can
      read 50% of everything written. But what does THAT mean if 50% means
      that you can read only 50% of each sentence?

      Or, if you get really ambitious you can learn 732 words and read 90%
      of everything written in English. If you want to be able to read 99.9%
      of everything written in English you will need to learn 2090 words.
      (These figures are from my own million-word corpus taken from 20th
      century fiction and non-fiction on Gutenberg.com.)

      So what does it really mean to say you can read 90% by knowing 732 words?

      Maybe the only meaningful measure of lexicon size is how many words
      you must know to cover some specified x% of the whole of the written
      corpus. That's a very different number for Toki Pona than it is for
      English. That way you could talk meaningfully about a specific
      language's "90% coverage lexicon", and its "98% coverage lexicon", and
      so on.


      --gary
    • Show all 59 messages in this topic