Loading ...
Sorry, an error occurred while loading the content.

Re: Hyperlinking a dictionary to a corpus

Expand Messages
  • Arthaey Angosii
    ... In the brackets, you mean? That s Unicode IPA for the lexemes. ... The link should work now; it just took a bit for the DNS changes to go through. ...
    Message 1 of 13 , Nov 30, 2005
    • 0 Attachment
      Emaelivpeith Gary Shannon:
      > That looks nice. What is in the parentheses. I'm just
      > seeing square blocks like I don't have a necessary
      > font.

      In the brackets, you mean? That's Unicode IPA for the lexemes.

      > The first link to samples didn't work, but I got there
      > from the link on the dictionary page. It looks nice.

      The link should work now; it just took a bit for the DNS changes to go through.

      > The thing that wouold concern me is where words might
      > have two or more slightly different meanings, or
      > perhaps there are two or more ways to use a particular
      > word. That would require at least a couple of
      > sentences in those cases.

      Indeed. If you look at, for example, _-ad_
      (http://dictionary.arthaey.com/#-ad), it has 3 distinct meanings. Two
      of those meanings have example sentences, although admittedly they are
      not numbered to go with the numbered meanings.


      --
      AA
      http://conlang.arthaey.com/

      (Gmail WARNING: watch the Reply-To!)
    • Jim Henry
      ... I d like to do that, & I ve made vague plans for it -- the script that generates the lexicon nxcgtx.htm from the tab-delimited file nxcgtx.rdb puts in id
      Message 2 of 13 , Dec 1, 2005
      • 0 Attachment
        On 11/30/05, Gary Shannon <fiziwig@...> wrote:

        > To that end, I've written a program to create my
        > dictionary page and a sample sentence page from a
        > dictionary file and a corpus of sentences. It shows a

        I'd like to do that, & I've made vague plans for it
        -- the script that generates the lexicon nxcgtx.htm
        from the tab-delimited file nxcgtx.rdb puts in
        id attributes on the <DT> tags so I can put in
        links to individual dictionary entries later, from
        words in the texts; and all the newer sample
        texts and sample sentences in the grammar
        documents have link anchors all over the place
        to be linked to from the dictionary.
        But actually automatically
        generating all those links will take a fair
        bit of coding. It might be easier to
        create a special file of just sample sentences
        and generate the links to that rather than
        try to write a script that will search the grammar,
        semantics and translation documents for
        sample sentences, find the nearest link anchor to each, and
        generate links to be merged with the dictionary
        entries for each morpheme. I'll try out some different
        methods, perhaps.

        --
        Jim Henry
        http://www.pobox.com/~jimhenry/gzb/gzb.htm
        ...Mind the gmail Reply-to: field
      • Gary Shannon
        ... ... Yes, this is what I did. I cut and pasted the sample sentences into a single corpus file, and then generated from that a sample sentence file
        Message 3 of 13 , Dec 1, 2005
        • 0 Attachment
          --- Jim Henry <jimhenry1973@...> wrote:

          > On 11/30/05, Gary Shannon <fiziwig@...> wrote:

          <snop>

          > It might be easier to
          > create a special file of just sample sentences
          > and generate the links to that rather than
          > try to write a script that will search the grammar,
          > semantics and translation documents for
          > sample sentences,

          Yes, this is what I did. I cut and pasted the sample
          sentences into a single corpus file, and then
          generated from that a sample sentence file that
          repeats each sentence as many times as there are words
          in the sentence. Thus, using Enlgish as an exmpale, "I
          will run" will generate three separate lines in the
          samples files:

          i - I will run
          run - I will run
          will - I will run

          The anchor names are the words themselves so the
          dictionary entry for "run" only has to have the anchor
          "#run" added to the link to the single samples file.
          At this anchor will be all the sample sentences that
          contain the word "run". You can see the actual sample
          sentences file at http://fiziwig.com/concord.html

          --gary
        • Jörg Rhiemeier
          Hallo! ... That s a nice idea! An example often says more than a thousand words in a dictionary entry. ... Good work! Greetings, Jörg.
          Message 4 of 13 , Dec 2, 2005
          • 0 Attachment
            Hallo!

            Gary Shannon wrote:

            > I had a nifty idea I thought I'd share in case anyone
            > cared to borrow it for their own conlang.
            >
            > I learn best by example and analogy so when I look up
            > a word in a dictionary I like to have ready access to
            > a handful of sentence that use that word so I can see
            > how the word actually behaves in the wild.

            That's a nice idea! An example often says more than a
            thousand words in a dictionary entry.

            > To that end, I've written a program to create my
            > dictionary page and a sample sentence page from a
            > dictionary file and a corpus of sentences. It shows a
            > maximum of 20 sample sentences (number is adjustable)
            > for a given word and the newer sentences push older
            > ones out of sight as time goes on, so that the sample
            > sentences continue to represent the most current usage
            > patterns.
            >
            > Here's what the hyperlinked Elomi dictionary (a work
            > in progress with a few bugs yet) looks like:
            > http://fiziwig.com/lexicon.html

            Good work!

            Greetings,

            Jörg.
          • Carsten Becker
            ... I haven t had a look at your dictionary yet, but do you think that will be possible to program as well using a MySQL database and PHP? It sounds like a
            Message 5 of 13 , Dec 2, 2005
            • 0 Attachment
              On Thu, 01 Dec 2005, 12:43 CET, Gary Shannon wrote:

              > To that end, I've written a program to create my
              > dictionary page and a sample sentence page from a
              > dictionary file and a corpus of sentences. It shows a
              > maximum of 20 sample sentences (number is adjustable)
              > for a given word and the newer sentences push older
              > ones out of sight as time goes on, so that the sample
              > sentences continue to represent the most current usage
              > patterns.

              I haven't had a look at your dictionary yet, but do you
              think that will be possible to program as well using a MySQL
              database and PHP? It sounds like a nifty idea. How does the
              program know, though, how to assemble example sentences?
              I.e. AFAIU your Elomi is isolating and rather simplistic
              (at least it seems so at first sight), but what about
              more complex agglutinating or even inflecting languages?

              To link words in example sentences, you'd need a script that
              would split at *morpheme* boundaries and when the respective
              morpheme already exists in the dictionary, a link to this
              one is provided. Well, but I quite don't know how to tell a
              program how to split on morpheme boundaries the way the
              sentence is meant ... for example, when there's a prefix
              _a-_ and there are words beginning with _a_, all first a's
              of such words would be linked to the entry for that prefix
              ...

              Carsten

              --
              Keywords: dictionary, programming

              "Miranayam cepauarà naranoaris."
              (Calvin nay Hobbes)
            • Gary Shannon
              ... wrote: ... Elomi, which, for the record, is the invention of Larry Sulky, (I just play with it for fun because it sounds neat and is easy to learn)
              Message 6 of 13 , Dec 2, 2005
              • 0 Attachment
                --- Carsten Becker <naranoieati@...>
                wrote:


                <snip>

                > I.e. AFAIU your Elomi is isolating and rather
                > simplistic
                > (at least it seems so at first sight), but what
                > about
                > more complex agglutinating or even inflecting
                > languages?

                Elomi, which, for the record, is the invention of
                Larry Sulky, (I just play with it for fun because it
                sounds neat and is easy to learn) does have that
                advantage. I'm not sure how I'd do it for more complex
                languages. My own Latin-like Tazhu has all kinds of
                case endings and verb conjugations that would make a
                project like that a nightmare. I imagine the program
                would need some kind of cross reference tables to
                relate one form of the word to all the other forms so
                that (to use a Latin example) the table would have to
                equate "cogitare", "cogito", "cogitas", "cogitat",
                "cogitamus", etc., etc. to each other so that they
                would all generate the same HTML anchor name, probably
                "#cogitare".

                For an agglutiating language it would be a real
                nightmare! I don't een want to think about that one
                ;-)

                --gary
              • Roger Mills
                ... Probably, yes. And shouldn t things like morior, loquor have not only their own links, but also a link to a general discussion of deponent verbs?? It do
                Message 7 of 13 , Dec 2, 2005
                • 0 Attachment
                  Gary Shannon wrote:
                  > Elomi, which, for the record, is the invention of
                  > Larry Sulky, (I just play with it for fun because it
                  > sounds neat and is easy to learn) does have that
                  > advantage. I'm not sure how I'd do it for more complex
                  > languages. My own Latin-like Tazhu has all kinds of
                  > case endings and verb conjugations that would make a
                  > project like that a nightmare. I imagine the program
                  > would need some kind of cross reference tables to
                  > relate one form of the word to all the other forms so
                  > that (to use a Latin example) the table would have to
                  > equate "cogitare", "cogito", "cogitas", "cogitat",
                  > "cogitamus", etc., etc. to each other so that they
                  > would all generate the same HTML anchor name, probably
                  > "#cogitare".

                  Probably, yes. And shouldn't things like "morior, loquor" have not only
                  their own links, but also a link to a general discussion of deponent verbs??
                  It do get complicated.
                  >
                  > For an agglutiating language it would be a real
                  > nightmare! I don't een want to think about that one
                  > ;-)

                  I was looking at Kash today, to see how/what I might exemplify, and decided
                  that _every_ word probably doesn't need to be. Further, what to do about
                  sandhi changes (e.g. most suffixes have -C(vl)V, -(nasal)C(vd)V and -C(vl)rV
                  allomorphs, which actually is discussed on the Morphology page-- so perhaps
                  just a link to that). Otherwise I think I'll do a large bunch of
                  "organized" example sentences for some of the knottier words (like _mesa_
                  'one' which can be used in various constructions and meanings-- 'one of...',
                  'single, sole', 'first (ordinal)', 'first of all (sentential adv.)' etc.
                  Other matters, like compound verbs, double accusatives etc. are or ought to
                  be discussed in the Syntax portion, which I haven't worked on for a long
                  time-- so links to that too. Ditto the as yet unwritten "Colloquial"
                  section.

                  More work for the winter season............:-))
                • Arthaey Angosii
                  ... Asha ille is agglutinating, so I had a similar problem to what you describe. I decided not to solve it programmatically, but rather to add some minimal
                  Message 8 of 13 , Dec 3, 2005
                  • 0 Attachment
                    Emaelivpeith Carsten Becker:
                    > How does the
                    > program know, though, how to assemble example sentences?
                    > I.e. AFAIU your Elomi is isolating and rather simplistic
                    > (at least it seems so at first sight), but what about
                    > more complex agglutinating or even inflecting languages?
                    >
                    > To link words in example sentences, you'd need a script that
                    > would split at *morpheme* boundaries and when the respective
                    > morpheme already exists in the dictionary, a link to this
                    > one is provided. Well, but I quite don't know how to tell a
                    > program how to split on morpheme boundaries the way the
                    > sentence is meant

                    Asha'ille is agglutinating, so I had a similar problem to what you
                    describe. I decided not to solve it programmatically, but rather to
                    add some minimal markup to my source text.

                    For example, the "word" |riyëvjosöte| ((which means "Can you
                    understand?") consists of 4 morphemes and 2 ablauts (dunno if those
                    count as their own morphemes). To make a computer-generated
                    interlinear out of that agglutination (hehe, I like that word :P ), I
                    mark up the text I feed into the program:

                    Riy[e]{ë}v|["]|[-]j[-]|[-]o|[-]s[ó]{ö}te|["]|

                    Now, that looks pretty nasty, but then, a program that could easily
                    sort it all out 100% of the time would also look pretty nasty. ;)

                    I use brackets "[]" to write how the morpheme exists in the
                    dictionary, ignoring any surface changes. I use braces "{}" to write
                    surface changes that do not show up as such in the dictionary. I use
                    pipes "|" to mark morpheme boundaries. So, the above breaks down into:

                    Riy[e]{ë}v| -- looked up as "riyev", displayed in the interlinear as "riyëv"
                    ["]| -- looked up as ", not displayed in the interlinear
                    [-]j[-]| -- looked up as "-j-", displayed as just "j"
                    [-]o| -- looked up as "-o", displayed as "o"
                    [-]s[ó]{ö}te| -- looked up as "-sóte", displayed as "söte"
                    ["]| -- same as previous ["]

                    I think that Asha'ille is just simple enough, even though
                    agglutinating, that I could have written a program to figure it out
                    without the manual markup. But I had thought at the time that others
                    might want to use my script, or that I might come up with another,
                    more complicated language. So I stay with my manual markup. :)


                    --
                    AA
                    http://conlang.arthaey.com/
                  • Jim Henry
                    ... Unless the conlang has a self-segregating morphology, you probably need to go through the example sentences file and manually mark them up with hyphens or
                    Message 9 of 13 , Dec 3, 2005
                    • 0 Attachment
                      On 12/2/05, Carsten Becker <naranoieati@...> wrote:
                      > On Thu, 01 Dec 2005, 12:43 CET, Gary Shannon wrote:

                      > database and PHP? It sounds like a nifty idea. How does the
                      > program know, though, how to assemble example sentences?
                      > I.e. AFAIU your Elomi is isolating and rather simplistic
                      > (at least it seems so at first sight), but what about
                      > more complex agglutinating or even inflecting languages?

                      > To link words in example sentences, you'd need a script that
                      > would split at *morpheme* boundaries and when the respective
                      > morpheme already exists in the dictionary, a link to this
                      > one is provided. Well, but I quite don't know how to tell a
                      > program how to split on morpheme boundaries the way the
                      > sentence is meant ... for example, when there's a prefix
                      > _a-_ and there are words beginning with _a_, all first a's
                      > of such words would be linked to the entry for that prefix

                      Unless the conlang has a self-segregating morphology,
                      you probably need to go through the example sentences
                      file and manually mark them up with hyphens or some other
                      divider characters between morphemes. Then your
                      conversion script would produce links around
                      the morphemes and delete the hyphen characters.

                      My sample sentences in gzb already have the hyphens
                      between morphemes; the tricky bit is that the
                      base version of the lexicon is in gzb's ASCII orthography
                      and the sample sentences are in Unicode. So
                      the conversion script needs to convert the sample
                      sentences from Unicode back to ASCII in memory
                      to match them to the dictionary's link anchors,
                      and vice versa.

                      Today I modified the script that formats glossed
                      sentences so it will automatically link each word
                      to the dictionary entry. I did that for the new
                      "Danti and the Donkey" text, but I don't think
                      I'll be redoing all the other texts that way
                      because most of them have had hand-edited
                      corrections to the HTML version that didn't
                      go back into the ASCII sources.

                      I need to make it also generate output to another
                      stream of some format like

                      <gzb word> <target doc>.htm#<anchor for sample sentence>

                      then the output of the other stream can be appended
                      to a table of cross-references that can be merged
                      with the lexicon table and used by the lexicon
                      HTMLization script. I've started working
                      on that but it's not finished, and it will only
                      work on new glosses; I'll need another script
                      to generate anchor lists for existing sentences.

                      --
                      Jim Henry
                      http://www.pobox.com/~jimhenry/gzb/gzb.htm
                      ...Mind the gmail Reply-to: field
                    Your message has been successfully submitted and would be delivered to recipients shortly.