Loading ...
Sorry, an error occurred while loading the content.

Re: "English has the most words of any language"

Expand Messages
  • Jyri Lehtinen
    ... Extracting the words responsible for the top 90% or so of discourse would certainly provide a much more robust measure for the lexicon size than
    Message 1 of 47 , Mar 19, 2013
    • 0 Attachment
      > One way to measure that would be to determine the 80% and 90% vocabulary of
      > English and compare it to the size of such vocabularies in other languages.
      > That'd give you a rough sense of the size of the lexicon, it seems to me.
      > I got the idea from this article, which puts the 80% vocabulary of English
      > at 2400 lemmas (I have no idea where that figure comes from). If so, that
      > puts English more or less on par with most modern languages.
      >

      Extracting the words responsible for the top 90% or so of discourse would
      certainly provide a much more robust measure for the lexicon "size" than
      attempting to come up with a usable definition for including a given word
      in the so called total lexicon. But I don't think it measures really the
      same thing we mean intuitively when talking about the size of the lexicon
      of a given language. It rather measures the amount of basic lexical items
      one uses to use in a given situation. If you consider compounds and
      derivations to be independent lexical items, I'd expect all languages to
      end up displaying roughly similar numbers here. In this case you'd really
      be measuring the amount of overloading of meaning a language has for its
      words, at least if you've managed to analyse equivalent discourse events.

      For a large sample of natural discourse you could also try to use a maximum
      statistic, namely the total amount of lexical items within the data. This
      sound to be much less robust for the size of the dataset and the discourse
      topics it includes than the 90% vocabulary approach, however.

      The way I would proceed would be to label each word (or what ever
      definition for a lexical item we'd be using) with its frequency in a
      dataset of fixed length and analyse their full distribution. In other words
      I'd plot the histogram of the words at different frequency bins. From the
      breadth of the histogram you could see the number of core words used in
      basic discourse. There would also be the low frequency tail of the
      distribution which would reveal how much specialised vocabulary the
      speakers are comfortable in using.

      -Jyri
    • Patrick Dunn
      ... Thanks. Your permission means a lot to me. --Patrick -- Second Person, a chapbook of poetry by Patrick Dunn, is now available for order from Finishing
      Message 47 of 47 , Mar 27, 2013
      • 0 Attachment
        >
        >
        >
        > Argue what you will, then.
        >
        > Padraic
        >
        >
        Thanks. Your permission means a lot to me.

        --Patrick

        --
        Second Person, a chapbook of poetry by Patrick Dunn, is now available for
        order from Finishing Line
        Press<http://www.finishinglinepress.com/NewReleasesandForthcomingTitles.htm>
        and
        Amazon<http://www.amazon.com/Second-Person-Patrick-Dunn/dp/1599249065/ref=sr_1_2?ie=UTF8&qid=1324342341&sr=8-2>.
      Your message has been successfully submitted and would be delivered to recipients shortly.