Loading ...
Sorry, an error occurred while loading the content.

Re: Typical lexicon size in natlangs

Expand Messages
  • H. S. Teoh
    ... [...] The problem with these percentages is that they obscure a basic fact of information theory: the most information is conveyed by the most unusual or
    Message 1 of 59 , May 11, 2013
    • 0 Attachment
      On Sat, May 11, 2013 at 01:53:10PM -0700, Gary Shannon wrote:
      > On Sat, May 11, 2013 at 12:08 PM, Alex Fink <000024@...> wrote:
      > >
      > > For instance, it's a number bandied around that knowing 500 hanzi
      > > will allow you to read 90% of the characters in a Chinese newspaper
      > > -- but usually by people who don't appreciate the fact that this
      > > includes all the grammatical and closed-class words, and a swathe
      > > of basic lexis, but probably not the ├Čnteresting word or two in the
      > > headline you care about.
      >
      > For example, if you know the most common 28 words in English you can
      > read 50% of everything written. But what does THAT mean if 50% means
      > that you can read only 50% of each sentence?
      >
      > Or, if you get really ambitious you can learn 732 words and read 90%
      > of everything written in English. If you want to be able to read 99.9%
      > of everything written in English you will need to learn 2090 words.
      > (These figures are from my own million-word corpus taken from 20th
      > century fiction and non-fiction on Gutenberg.com.)
      >
      > So what does it really mean to say you can read 90% by knowing 732
      > words?
      >
      > Maybe the only meaningful measure of lexicon size is how many words
      > you must know to cover some specified x% of the whole of the written
      > corpus. That's a very different number for Toki Pona than it is for
      > English. That way you could talk meaningfully about a specific
      > language's "90% coverage lexicon", and its "98% coverage lexicon", and
      > so on.
      [...]

      The problem with these percentages is that they obscure a basic fact of
      information theory: the most information is conveyed by the most unusual
      or outstanding bits. The stuff that's repeated almost everywhere has
      very low information content. So if I can understand 50% of the most
      common words in a given text, but most of that 50% is just grammatical
      words, then I actually *don't* understand 50% of the information
      conveyed by that text, but far less, probably only 5% or so. OTOH, if
      of that 50% that I understand 40% are content words, then I may have a
      far better understanding of the information conveyed by the text, even
      if I'm ignorant of most of the grammatical particles and constructions.

      For example, given the English sentence:

      Last week in an upscale neighbourhood in downtown Manhattan a
      woman was brutally murdered by a suspected sex offender, thought
      to be dangerously armed.

      If I only know the most common grammatical words, then it would read
      like this to me:

      **** **** in an ******* ************* in ******** ********* a
      woman was ******** ******** by a ********* *** ********, *******
      to be *********** *****.

      The text is essentially opaque. But if I *didn't* know common words like
      "in", "an", "by", etc., but do recognise some of the keywords, what I
      comprehend might be something like:

      Last week ** ** ******* neighbourhood ** ******** Manhattan *
      woman *** ******** murdered ** * ********* *** offender, *******
      ** ** *********** armed.

      I can understand the gist of the text far better, even if the specific
      details are incomprehensible to me. Note also that in the latter case I
      only recognized 8 words, yet understood more than the first case, where
      10 words were recognized but almost zero information was conveyed.


      T

      --
      Answer: Because it breaks the logical sequence of discussion.
      Question: Why is top posting bad?
    • Juanma Barranquero
      ... Sure. But this thread discusses typical lexicon size , and Gary Shannon and H. S. Teoh proposed a bootstrap lexicon size as a meaningful measure. And
      Message 59 of 59 , May 20, 2013
      • 0 Attachment
        On Mon, May 20, 2013 at 5:32 PM, Anthony Miles <mamercus88@...> wrote:

        > Even in an impoverished environment humans or something like them will expand vocabulary.

        Sure. But this thread discusses "typical lexicon size", and Gary
        Shannon and H. S. Teoh proposed a "bootstrap lexicon size" as a
        meaningful measure. And I'm just pointing out that I don't think it
        would be a good metric, because if you use it for many languages, and
        the resulting size varies, let's say, between X-10% and X+10% for some
        X, that does not offer any insight about the *typical* lexicon size of
        the languages so tested. Systems of vastly different complexity can
        arise from similarly simple foundations (cellular automata are a clear
        example of that).

        J
      Your message has been successfully submitted and would be delivered to recipients shortly.