Loading ...
Sorry, an error occurred while loading the content.

Re: Typical lexicon size in natlangs

Expand Messages
  • Alex Fink
    ... Indeed, that s probably one of the less arbitrary ways to actually generate a number in practice. But in view of other figures I ve seen, I suspect any of
    Message 1 of 59 , May 11, 2013
      On Sat, 11 May 2013 08:10:48 -0400, Jim Henry <jimhenry1973@...> wrote:

      >George Corley, I think it was, suggested a less arbitrary way to
      >filter out the archaic words and specialized jargon than simply
      >declaring a certain date cut-off or marking certain semantic domains
      >off-limits. He suggested taking a large corpus of recent texts and
      >looking for the set of most frequent words that constitute 90% (or
      >80%, or whatever) of those texts. That would give you an idea of the
      >core vocabulary of a specific language -- the set of words that many
      >or most speakers use frequently -- without using arbitrary
      >cross-linguistic standards like the Swadesh List. You can set the
      >figure to 95% or even 99%, as long as you use the same figure for all
      >the languages whose corpora you're comparing.

      Indeed, that's probably one of the less arbitrary ways to actually generate a number in practice. But in view of other figures I've seen, I suspect any of those thresholds will probably yield drastic undercounting compared to the kind of numbers you'd like for speakers' mental lexicon size (though maybe 99% begins to get close). For instance, it's a number bandied around that knowing 500 hanzi will allow you to read 90% of the characters in a Chinese newspaper -- but usually by people who don't appreciate the fact that this includes all the grammatical and closed-class words, and a swathe of basic lexis, but probably not the ìnteresting word or two in the headline you care about.

      In fact, I wonder how much variation there would be from language to language in the rate at which this number of words varies with the cutoff -- e.g. the exponent if you fit the growth to a power law. That seems like it should be even less subject to irrelevant factors.

      >Of course, that still leaves some arbitrary decisions about marking
      >word boundaries in your corpus before you parse it. And for some
      >languages, a larger corpus will be available than for others. But I
      >think it should give a less arbirary, more comparable method of
      >comparing different languages than simply counting entries in
      >dictionaries, when the lexicographers working with different languages
      >may have been using very different design principles and had different
      >resources available to them.

      Another concern with corpus methods is that, if you want to disregard inflectional morphology when deciding what counts as the same word (and surely you do?), you still need a good stemmer for the language in question. But if you accept that, it's not possible to completely avoid judģment calls on what's inflection and what's derivation; or even if there are no judģment calls, it might rely too much on semantic understanding for the software to be able to do it. (Drifting completely away from objectivity, my own proclivity would be to answer this question fuzzily. E.g. once you know the English word "build", you don't get the nominal sense of "building" completely for free, but neither is it a wholly separate word of its own; perhaps it should count as one half of a word?)

    • Juanma Barranquero
      ... Sure. But this thread discusses typical lexicon size , and Gary Shannon and H. S. Teoh proposed a bootstrap lexicon size as a meaningful measure. And
      Message 59 of 59 , May 20, 2013
        On Mon, May 20, 2013 at 5:32 PM, Anthony Miles <mamercus88@...> wrote:

        > Even in an impoverished environment humans or something like them will expand vocabulary.

        Sure. But this thread discusses "typical lexicon size", and Gary
        Shannon and H. S. Teoh proposed a "bootstrap lexicon size" as a
        meaningful measure. And I'm just pointing out that I don't think it
        would be a good metric, because if you use it for many languages, and
        the resulting size varies, let's say, between X-10% and X+10% for some
        X, that does not offer any insight about the *typical* lexicon size of
        the languages so tested. Systems of vastly different complexity can
        arise from similarly simple foundations (cellular automata are a clear
        example of that).

      Your message has been successfully submitted and would be delivered to recipients shortly.