Loading ...
Sorry, an error occurred while loading the content.
 

Re: Chomskybot

Expand Messages
  • Dustfinger Batailleur
    http://en.wikipedia.org/wiki/Zipf s_law The above sounds convenient for this purpose.
    Message 1 of 14 , Apr 1 8:50 AM
      http://en.wikipedia.org/wiki/Zipf's_law

      The above sounds convenient for this purpose.


      On 1 April 2013 11:47, Gary Shannon <fiziwig@...> wrote:

      > I think to pass for a real language the easiest way to approach a
      > conlangbot would be to have it build a lexicon of a couple thousand words,
      > making a clear distinction between words types (parts of speech). Each word
      > should have a specific, but randomly generated frequency of occurrence.
      >
      > Then there should be an actual generative grammar that would build random
      > sentences in a systematic way. There would not have to be a systematic
      > phonology since it would only be a written language, but there would need
      > to be a systematic orthography and morphology.
      >
      > The grammar might even include systematic construction of noun cases, verb
      > tenses, grammatical gender, structure of compound sentences, dependent
      > clauses, formulaic structures like "AS large AS..." or "should never have
      > been..." etc. The lexicon should also include common idioms and set
      > phrases.
      > .
      > In other words, it would essentially be a complete conlang, but without any
      > semantic content. It would be all grammar and no meaning.
      >
      > --gary
      >
      >
      > On Sun, Mar 31, 2013 at 9:13 PM, Daniel Myers <doc@...> wrote:
      >
      > >
      > > I wrote a program quite a while back that generated nonsense based on
      > > text from Lovecraft's Cthulhu mythos. I had code that deconstructed the
      > > source text and build new words based on letter and syllable patterns,
      > > and then used the new words to construct sentence fragments. Those
      > > fragments were then used to build text the same way as done by the
      > > Chomskybot.
      > >
      > > I suspect that it would pass at least some of the statistical tests for
      > > a language, but its output is indeed gibberish.
      > >
      > > http://www.medievalcookery.com/bookofrefreshments/cgi/pnakotic.pl
      > >
      > > While the text generator is currently using the same sentence fragments
      > > over and over, however I could probably combine the code for all the
      > > steps pretty easily. Here's a sample of the output:
      > >
      > > Zadai tsag nagn, coggoth ign zhol nkothai zophkaul gnuntath ph'ngla
      > > shaagai wgua schi unai bung mnihi sauth yoth ngegn ygguthu nilabbai
      > > nkaattath tsun ki yuttaong bung glath. R'loih untor mnath cfigar
      > > wgobbiggua bui gho ph'nglac shuak paghyaem: shoghoir yiddah gnichos
      > > untognaghal cthol nkobhoggighagl untognaghal hun gnichos ngharibbor
      > > gnichos chagh coth gnuntath iteg ph'nglonthoiphki kagoqiggak sagl pnafl
      > > yalha ithara gnosti yaggar ghoth. Cfagua, yalha chogn pna yokeh nuh,
      > > waag untognaghal n'gaulattoa gnichos tastoa gnichos untognaghal wua
      > > llenao schoh'nong ghag glar lauth wgua fezaddogogg zos ghogai. Nagons
      > > tah gho ph'nglac tayazoa nuggal wguth pna yarlah ghabbugn gnichos
      > > coggoth rhuk nkur cothlegn zolommoggir llokang r'lyeh ugn gnuntath
      > > bothal mglw'noba choghoquth gafh ki aghaoh unai og ybbaufh lornar unai
      > > ki fu. Llo hagn, shoghoir saghon fohang mglw'nethoith nonagg cfto ki hib
      > > zadai op unai ki zoltgm.
      > >
      > > - Doc
      > >
      > >
      > > > -------- Original Message --------
      > > > From: Gary Shannon <fiziwig@...>
      > > > Date: Sun, March 31, 2013 11:55 pm
      > > >
      > > > I'd be interested in seeing a conlangbot that put out text that passes
      > > all
      > > > statistical tests for an actual language, but isn't. Kinda like a
      > Voynich
      > > > generator, but with a known alphabet.
      > > >
      > > > --gary
      > > >
      > > >
      > > > On Sun, Mar 31, 2013 at 7:33 PM, Padraic Brown <elemtilas@...>
      > > wrote:
      > > >
      > > > > --- On Sun, 3/31/13, Dustfinger Batailleur <dustfinger42@...>
      > > wrote:
      > > > >
      > > > > > This is actually amazing.
      > > > > >
      > > > > > http://en.wikipedia.org/wiki/Chomskybot
      > > > >
      > > > > Yeah. That is amazingly comprehensible. If this bot puts out
      > linguistic
      > > > > gibberish, then one wonders what sorts of rubbish the actual linguist
      > > put
      > > > > out? ;))
      > > > >
      > > > > Padraic
      > > > >
      > > > >
      > >
      >
    • Garth Wallace
      ... That s what I was thinking. I m not sure you d even need to come up with a morphology though. You could just use lists of static wordforms. It wouldn t be
      Message 2 of 14 , Apr 1 8:56 AM
        On Mon, Apr 1, 2013 at 8:47 AM, Gary Shannon <fiziwig@...> wrote:
        > I think to pass for a real language the easiest way to approach a
        > conlangbot would be to have it build a lexicon of a couple thousand words,
        > making a clear distinction between words types (parts of speech). Each word
        > should have a specific, but randomly generated frequency of occurrence.
        >
        > Then there should be an actual generative grammar that would build random
        > sentences in a systematic way. There would not have to be a systematic
        > phonology since it would only be a written language, but there would need
        > to be a systematic orthography and morphology.
        >
        > The grammar might even include systematic construction of noun cases, verb
        > tenses, grammatical gender, structure of compound sentences, dependent
        > clauses, formulaic structures like "AS large AS..." or "should never have
        > been..." etc. The lexicon should also include common idioms and set phrases.
        > .
        > In other words, it would essentially be a complete conlang, but without any
        > semantic content. It would be all grammar and no meaning.

        That's what I was thinking. I'm not sure you'd even need to come up
        with a morphology though. You could just use lists of static
        wordforms. It wouldn't be all that difficult actually.
      • Alex Fink
        ... What you describe here is quite cool -- and fits right in with my random-language-generation megaproject; semantics is hàrd so something approximating
        Message 3 of 14 , Apr 1 9:01 AM
          On Mon, 1 Apr 2013 08:47:49 -0700, Gary Shannon <fiziwig@...> wrote:

          >I think to pass for a real language the easiest way to approach a
          >conlangbot would be to have it build a lexicon of a couple thousand words,
          >making a clear distinction between words types (parts of speech). Each word
          >should have a specific, but randomly generated frequency of occurrence.
          >
          >Then there should be an actual generative grammar that would build random
          >sentences in a systematic way. There would not have to be a systematic
          >phonology since it would only be a written language, but there would need
          >to be a systematic orthography and morphology.
          >
          >The grammar might even include systematic construction of noun cases, verb
          >tenses, grammatical gender, structure of compound sentences, dependent
          >clauses, formulaic structures like "AS large AS..." or "should never have
          >been..." etc. The lexicon should also include common idioms and set phrases.
          >.
          >In other words, it would essentially be a complete conlang, but without any
          >semantic content. It would be all grammar and no meaning.

          What you describe here is quite cool -- and fits right in with my random-language-generation megaproject; semantics is hàrd so something approximating this is probably the furthest I'll ever get in actuality -- but would probably be an immense lot of work.

          I think that to pass for a real language, the _easiest_ way would be something like projects you've suggested around here earlier: pick a random passage in a random natlang from the internet or something, then massage it by enough randomly generated local rewrite rules (think sound changes, but they don't have to be plausible sound changes; 'swap "b" and "ch"' is fine too) until any resemblance to the original has entirely receded from detectability.

          Alex
        • Gary Shannon
          ... Another easy approach would be like MadLibs . Just have a collection of a few hundred sentences, or phrases with some conjunctions that could be used to
          Message 4 of 14 , Apr 1 10:37 AM
            On Mon, Apr 1, 2013 at 9:01 AM, Alex Fink <000024@...> wrote:

            > On Mon, 1 Apr 2013 08:47:49 -0700, Gary Shannon <fiziwig@...> wrote:
            >
            > >I think to pass for a real language the easiest way to approach a
            > >conlangbot would be to have it build a lexicon of a couple thousand words,
            > >making a clear distinction between words types (parts of speech). Each
            > word
            > >should have a specific, but randomly generated frequency of occurrence.
            > ----
            >
            > What you describe here is quite cool -- and fits right in with my
            > random-language-generation megaproject; semantics is hàrd so something
            > approximating this is probably the furthest I'll ever get in actuality --
            > but would probably be an immense lot of work.
            >
            > I think that to pass for a real language, the _easiest_ way would be
            > something like projects you've suggested around here earlier: pick a random
            > passage in a random natlang from the internet or something, then massage it
            > by enough randomly generated local rewrite rules (think sound changes, but
            > they don't have to be plausible sound changes; 'swap "b" and "ch"' is fine
            > too) until any resemblance to the original has entirely receded from
            > detectability.
            >
            > Alex
            >

            Another easy approach would be like "MadLibs". Just have a collection of a
            few hundred sentences, or phrases with some "conjunctions" that could be
            used to paste simple ones together into longer ones. Then just fill the
            empty slots from a random lexicon with the appropriate frequency
            distribution.

            [name], [job title] who is the [adj] [occupation] for [organization] where
            the [event] occurred, told me that [noun] was less [adjective] this month
            as [event] continued, saying it was [comparative adj] than the [noun]
            [verb-ed] on a [noun].

            Then, of course, you could construct similar templates using nonsense words
            with slots to be filled from one of several different lexical classes based
            on Zipf-type frequency distribution.

            Ur [class 1] myogin da [class 3a] sen [class 2], [class 2] [class 4], num
            da [class 1] ka+[root 7]+ya min.

            --gary
          • Alex Fink
            ... Mm, maybe you could dig up a copy of old Racter and gibberish its data files appropriately. But here are some ways I
            Message 5 of 14 , Apr 1 11:32 AM
              On Mon, 1 Apr 2013 10:37:18 -0700, Gary Shannon <fiziwig@...> wrote:

              >On Mon, Apr 1, 2013 at 9:01 AM, Alex Fink <000024@...> wrote:
              >
              >> On Mon, 1 Apr 2013 08:47:49 -0700, Gary Shannon <fiziwig@...> wrote:
              >>
              >> >I think to pass for a real language the easiest way to approach a
              >> >conlangbot would be to have it build a lexicon of a couple thousand words,
              >> >making a clear distinction between words types (parts of speech). Each
              >> word
              >> >should have a specific, but randomly generated frequency of occurrence.
              >> ----
              >>
              >> I think that to pass for a real language, the _easiest_ way would be
              >> something like projects you've suggested around here earlier: pick a random
              >> passage in a random natlang from the internet or something, then massage it
              >> by enough randomly generated local rewrite rules (think sound changes, but
              >> they don't have to be plausible sound changes; 'swap "b" and "ch"' is fine
              >> too) until any resemblance to the original has entirely receded from
              >> detectability.
              >
              >Another easy approach would be like "MadLibs". Just have a collection of a
              >few hundred sentences, or phrases with some "conjunctions" that could be
              >used to paste simple ones together into longer ones. Then just fill the
              >empty slots from a random lexicon with the appropriate frequency
              >distribution.
              >
              >[name], [job title] who is the [adj] [occupation] for [organization] where
              >the [event] occurred, told me that [noun] was less [adjective] this month
              >as [event] continued, saying it was [comparative adj] than the [noun]
              >[verb-ed] on a [noun].
              >
              >Then, of course, you could construct similar templates using nonsense words
              >with slots to be filled from one of several different lexical classes based
              >on Zipf-type frequency distribution.
              >
              >Ur [class 1] myogin da [class 3a] sen [class 2], [class 2] [class 4], num
              >da [class 1] ka+[root 7]+ya min.

              Mm, maybe you could dig up a copy of old Racter <https://en.wikipedia.org/wiki/Racter> and gibberish its data files appropriately.

              But here are some ways I might try to tell those apart, at least with large corpora (of course, the larger the corpus, the harder it is to pass):
              - Can you see the sentence types repeating? :-p
              - Are there correlations between pairs of content words which tend to appear in proximity? Natlang passages will have topics, so you're liable to get a bunch of food words together, or of political vocab together, or whatnot.

              Alex
            • Ralph DeCarli
              On Sun, 31 Mar 2013 20:55:35 -0700 ... That sounds a little like DadaDodo. From the website: DadaDodo is a program that analyses texts for word probabilities,
              Message 6 of 14 , Apr 1 1:48 PM
                On Sun, 31 Mar 2013 20:55:35 -0700
                Gary Shannon <fiziwig@...> wrote:

                > I'd be interested in seeing a conlangbot that put out text that
                > passes all statistical tests for an actual language, but isn't.
                > Kinda like a Voynich generator, but with a known alphabet.
                >
                > --gary
                >
                That sounds a little like DadaDodo.

                From the website:

                "DadaDodo is a program that analyses texts for word probabilities,
                and then generates random sentences based on that. Sometimes these
                sentences are nonsense; but sometimes they cut right through to the
                heart of the matter, and reveal hidden meanings."

                http://www.jwz.org/dadadodo/

                Ralph
                --
                omnivore@... ==> Ralph L. De Carli

                Have you heard of the new post-neo-modern art style?
                They haven't decided what it looks like yet.
              • Gary Shannon
                ... Maybe content words could be grouped into topics (where topic does not refer to anything real, of course, but just to a chance association between groups
                Message 7 of 14 , Apr 1 2:32 PM
                  On Mon, Apr 1, 2013 at 11:32 AM, Alex Fink <000024@...> wrote:

                  > ---
                  >
                  > Mm, maybe you could dig up a copy of old Racter <
                  > https://en.wikipedia.org/wiki/Racter> and gibberish its data files
                  > appropriately.
                  >
                  > But here are some ways I might try to tell those apart, at least with
                  > large corpora (of course, the larger the corpus, the harder it is to pass):
                  > - Can you see the sentence types repeating? :-p
                  > - Are there correlations between pairs of content words which tend to
                  > appear in proximity? Natlang passages will have topics, so you're liable
                  > to get a bunch of food words together, or of political vocab together, or
                  > whatnot.
                  >
                  > Alex
                  >

                  Maybe content words could be grouped into "topics" (where topic does not
                  refer to anything real, of course, but just to a chance association between
                  groups of words). Of course if this type of reasoning is carried too far
                  then the "gibberish" might start to become meaningful, just because it
                  meets the statistical criteria for meaningful text. Then the task would be
                  to discover the meaning. Kind of like picking up a book in some language
                  you know nothing about and trying to figure out the meaning without a
                  dictionary or reference grammar.

                  One way to construct conlang gibberish automatically would be to collect
                  three-word sequences from a source language and then look for collocations
                  shared between groups of words. For example, "the old man", "the last man",
                  "the unhappy man",... would imply that {old, last, unhappy} constitutes a
                  set of interchangeable words. We can call it category 1. Now a template
                  "the [1] man" could be created. Further analysis might reveal that "man"
                  also has many other words that occur in similar contexts (say category 2)
                  so that the template can be generalized to read "the [1] [2]", and so on.

                  By cross referencing the data in every way imaginable you should be able to
                  come up with sub-templates that might look like "[3] [12] [6]" where those
                  three numbers refer to three word categories. Then that pattern could be
                  concatenated with another pattern that shared the last two category
                  numbers, e.g. "[12] [6] [9]" giving the pattern "[3] [12] [6] [9]", and so
                  on until an "end-of-sentence" flag was found in a pattern. Then words could
                  be extracted from the lexicon and plugged into the template to create a
                  "sentence".

                  This, of course, is just a second order Markov chain, but with the
                  generated strings being a list of word categories rather than actual words.
                  Then a second pass would select words from each given category to generate
                  the actual text. That second pass could filter for "topic".

                  Using this method a bogus hybrid conlang could be generated automatically
                  by scanning a large sample of a source text in language A (say Hungarian)
                  to extract the sentence-level word chain statistics, and a second language
                  B (say Spanish) to extract the syllable-level word generation statistics.
                  Then the bogolang would have the large scale statistical properties of
                  Hungarian with the word-level orthographic properties of Spanish.

                  --gary
                Your message has been successfully submitted and would be delivered to recipients shortly.