Loading ...
Sorry, an error occurred while loading the content.

Re: Typical lexicon size in natlangs

Expand Messages
  • MorphemeAddict
    This makes me wonder how many words/characters one needs to know beyond the basic structural/core words that occur in any or all contexts, essentially the
    Message 1 of 59 , May 11, 2013
      This makes me wonder how many words/characters one needs to know beyond the
      basic structural/core words that occur in any or all contexts, essentially
      the grammar words. How many characters in Chinese (Mandarin/Putonghua)
      represent 'empty', structural/grammar words? And what are they?


      On Sat, May 11, 2013 at 3:08 PM, Alex Fink <000024@...> wrote:

      > On Sat, 11 May 2013 08:10:48 -0400, Jim Henry <jimhenry1973@...>
      > wrote:
      > >George Corley, I think it was, suggested a less arbitrary way to
      > >filter out the archaic words and specialized jargon than simply
      > >declaring a certain date cut-off or marking certain semantic domains
      > >off-limits. He suggested taking a large corpus of recent texts and
      > >looking for the set of most frequent words that constitute 90% (or
      > >80%, or whatever) of those texts. That would give you an idea of the
      > >core vocabulary of a specific language -- the set of words that many
      > >or most speakers use frequently -- without using arbitrary
      > >cross-linguistic standards like the Swadesh List. You can set the
      > >figure to 95% or even 99%, as long as you use the same figure for all
      > >the languages whose corpora you're comparing.
      > Indeed, that's probably one of the less arbitrary ways to actually
      > generate a number in practice. But in view of other figures I've seen, I
      > suspect any of those thresholds will probably yield drastic undercounting
      > compared to the kind of numbers you'd like for speakers' mental lexicon
      > size (though maybe 99% begins to get close). For instance, it's a number
      > bandied around that knowing 500 hanzi will allow you to read 90% of the
      > characters in a Chinese newspaper -- but usually by people who don't
      > appreciate the fact that this includes all the grammatical and closed-class
      > words, and a swathe of basic lexis, but probably not the ìnteresting word
      > or two in the headline you care about.
      > In fact, I wonder how much variation there would be from language to
      > language in the rate at which this number of words varies with the cutoff
      > -- e.g. the exponent if you fit the growth to a power law. That seems like
      > it should be even less subject to irrelevant factors.
      > >Of course, that still leaves some arbitrary decisions about marking
      > >word boundaries in your corpus before you parse it. And for some
      > >languages, a larger corpus will be available than for others. But I
      > >think it should give a less arbirary, more comparable method of
      > >comparing different languages than simply counting entries in
      > >dictionaries, when the lexicographers working with different languages
      > >may have been using very different design principles and had different
      > >resources available to them.
      > Another concern with corpus methods is that, if you want to disregard
      > inflectional morphology when deciding what counts as the same word (and
      > surely you do?), you still need a good stemmer for the language in
      > question. But if you accept that, it's not possible to completely avoid
      > judģment calls on what's inflection and what's derivation; or even if there
      > are no judģment calls, it might rely too much on semantic understanding for
      > the software to be able to do it. (Drifting completely away from
      > objectivity, my own proclivity would be to answer this question fuzzily.
      > E.g. once you know the English word "build", you don't get the nominal
      > sense of "building" completely for free, but neither is it a wholly
      > separate word of its own; perhaps it should count as one half of a word?)
      > Alex
    • Juanma Barranquero
      ... Sure. But this thread discusses typical lexicon size , and Gary Shannon and H. S. Teoh proposed a bootstrap lexicon size as a meaningful measure. And
      Message 59 of 59 , May 20, 2013
        On Mon, May 20, 2013 at 5:32 PM, Anthony Miles <mamercus88@...> wrote:

        > Even in an impoverished environment humans or something like them will expand vocabulary.

        Sure. But this thread discusses "typical lexicon size", and Gary
        Shannon and H. S. Teoh proposed a "bootstrap lexicon size" as a
        meaningful measure. And I'm just pointing out that I don't think it
        would be a good metric, because if you use it for many languages, and
        the resulting size varies, let's say, between X-10% and X+10% for some
        X, that does not offer any insight about the *typical* lexicon size of
        the languages so tested. Systems of vastly different complexity can
        arise from similarly simple foundations (cellular automata are a clear
        example of that).

      Your message has been successfully submitted and would be delivered to recipients shortly.