Loading ...
Sorry, an error occurred while loading the content.

Generic Dictionary Software

Expand Messages
  • Logan Kearsley
    I finally got around to finishing up last Monday s really long episode of Conlangery, and got all excited about the talk of dictionaries. A bunch of people I
    Message 1 of 13 , Jul 1, 2012
    • 0 Attachment
      I finally got around to finishing up last Monday's really long episode
      of Conlangery, and got all excited about the talk of dictionaries.

      A bunch of people I work with recently attended the CALICO conference
      (https://calico.org/) on using technology in language instruction, and
      apparently one of the results of discussions there is that a bunch of
      NLP researchers want to put together a common API for NLP tools and
      resources, and one important type of resource is dictionaries. This of
      course means having a standard format for representing dictionary
      information. I immediately thought, "dang, conlangers have put a lot
      of work into trying to come up with good dictionary formats, I wonder
      how a bunch of NLP researchers are going to do," and then had a
      heckuva time trying to coherently explain what the difficulty of
      reusing dictionary formats between wildly different languages.
      features of dictionaries.

      Given my current educational & employment situation, there's a good
      chance that I may end up doing research on parts of this imagined API
      in the fall, for which I could get money and/or academic credit. And
      if I don't end up researching dictionary formats, *somebody* will
      eventually, and if they do a good job then whatever standard is
      defined for that ought to make a great basis for a storage format for
      some conlanging dictionary tools. Of course, there's still
      output/display formatting to worry about, but I find that once you've
      got an appropriate data structure, figuring out how to layout and
      style the data for human consumption is comparatively trivial.
      (If I *don't* end up on that sort of project, and nobody else gets
      around to it fast enough, I may be sufficiently interested to go the
      Kickstarter route on this.)

      So, I've now got both conlanging and serious academic interest in
      trying to solve this dictionary problem, and the programming skills to
      back it up. There's no need to try tricking linguists into thinking
      it's for them to drum up support, because it really genuinely is for
      them, but for conlangers, too.

      I took down notes of what sorts of things were requested from the
      participants in Conlangery, and got this basic list of features that
      relate to actual stored data:

      A lexical entry system that has fields for
      1) multiple definitions
      2) usage examples
      3) references to translations / texts that use the word
      4) tags

      Definitions for tags with information like "this is a guiding
      metaphor, things that work within this metaphor have this tag"
      Question- what all should be included in the description of a lexical
      item? This is the biggest point of difference between languages, I
      think.

      And then these features for a program:
      A lexical entry editing system that's configurable for different
      languages (configurable how exactly?)
      Automatic searching for references & adding them to lexical entries.
      Generating English-Lang, Lang-English formatted output. (only works if
      entries contain sufficient data to match definitions in one language
      to entries in another language)

      SIL Shoebox has also been mentioned as "terrible"; for people who have
      actually used it, what exactly is wrong with it?

      Any ideas for other things that ought to be included, both in terms of
      data that ought to be storeable and features that a program should
      have for manipulating it?

      -l.
    • George Corley
      What you have listed there is probably mandatory for most languages. I would add word class (/ part of speech) -- which needs user-defined values, and
      Message 2 of 13 , Jul 1, 2012
      • 0 Attachment
        What you have listed there is probably mandatory for most languages. I
        would add word class (/ part of speech) -- which needs user-defined values,
        and possibly an optional etymology variable. In addition, I think it is
        absolutely essential to have, in addition to these stock variables, the
        option to create user-defined variables. One of the biggest problems with
        any standardized dictionary format is that different languages have
        different needs. In one language, you may have a number of irregular forms
        that you'll need to turn on for certain words -- and said irregular forms
        need to be labelled appropriately -- but the software can't possibly list
        all the linguistic terminology it would need for that in a way that would
        in any way be easy to use.

        On Sun, Jul 1, 2012 at 4:08 AM, Logan Kearsley <chronosurfer@...>wrote:

        > I finally got around to finishing up last Monday's really long episode
        > of Conlangery, and got all excited about the talk of dictionaries.
        >
        > A bunch of people I work with recently attended the CALICO conference
        > (https://calico.org/) on using technology in language instruction, and
        > apparently one of the results of discussions there is that a bunch of
        > NLP researchers want to put together a common API for NLP tools and
        > resources, and one important type of resource is dictionaries. This of
        > course means having a standard format for representing dictionary
        > information. I immediately thought, "dang, conlangers have put a lot
        > of work into trying to come up with good dictionary formats, I wonder
        > how a bunch of NLP researchers are going to do," and then had a
        > heckuva time trying to coherently explain what the difficulty of
        > reusing dictionary formats between wildly different languages.
        > features of dictionaries.
        >
        > Given my current educational & employment situation, there's a good
        > chance that I may end up doing research on parts of this imagined API
        > in the fall, for which I could get money and/or academic credit. And
        > if I don't end up researching dictionary formats, *somebody* will
        > eventually, and if they do a good job then whatever standard is
        > defined for that ought to make a great basis for a storage format for
        > some conlanging dictionary tools. Of course, there's still
        > output/display formatting to worry about, but I find that once you've
        > got an appropriate data structure, figuring out how to layout and
        > style the data for human consumption is comparatively trivial.
        > (If I *don't* end up on that sort of project, and nobody else gets
        > around to it fast enough, I may be sufficiently interested to go the
        > Kickstarter route on this.)
        >
        > So, I've now got both conlanging and serious academic interest in
        > trying to solve this dictionary problem, and the programming skills to
        > back it up. There's no need to try tricking linguists into thinking
        > it's for them to drum up support, because it really genuinely is for
        > them, but for conlangers, too.
        >
        > I took down notes of what sorts of things were requested from the
        > participants in Conlangery, and got this basic list of features that
        > relate to actual stored data:
        >
        > A lexical entry system that has fields for
        > 1) multiple definitions
        > 2) usage examples
        > 3) references to translations / texts that use the word
        > 4) tags
        >
        > Definitions for tags with information like "this is a guiding
        > metaphor, things that work within this metaphor have this tag"
        > Question- what all should be included in the description of a lexical
        > item? This is the biggest point of difference between languages, I
        > think.
        >
        > And then these features for a program:
        > A lexical entry editing system that's configurable for different
        > languages (configurable how exactly?)
        > Automatic searching for references & adding them to lexical entries.
        > Generating English-Lang, Lang-English formatted output. (only works if
        > entries contain sufficient data to match definitions in one language
        > to entries in another language)
        >
        > SIL Shoebox has also been mentioned as "terrible"; for people who have
        > actually used it, what exactly is wrong with it?
        >
        > Any ideas for other things that ought to be included, both in terms of
        > data that ought to be storeable and features that a program should
        > have for manipulating it?
        >
        > -l.
        >
      • George Corley
        ... Oh, I meant to say more: One language could need those irregular forms, but another might need gender information for nouns, or note what case an
        Message 3 of 13 , Jul 1, 2012
        • 0 Attachment
          On Sun, Jul 1, 2012 at 4:19 AM, George Corley <gacorley@...> wrote:

          > What you have listed there is probably mandatory for most languages. I
          > would add word class (/ part of speech) -- which needs user-defined values,
          > and possibly an optional etymology variable. In addition, I think it is
          > absolutely essential to have, in addition to these stock variables, the
          > option to create user-defined variables. One of the biggest problems with
          > any standardized dictionary format is that different languages have
          > different needs. In one language, you may have a number of irregular forms
          > that you'll need to turn on for certain words -- and said irregular forms
          > need to be labelled appropriately -- but the software can't possibly list
          > all the linguistic terminology it would need for that in a way that would
          > in any way be easy to use.


          Oh, I meant to say more: One language could need those irregular forms, but
          another might need gender information for nouns, or note what case an
          adposition takes. Still another might need a numeral classifier listed for
          every noun, or case alignment or argument structure information for every
          verb. It all depends on the language.

          Also, as regards Shoebox: It actually has much of what you'd need for a
          good dictionary. However, I have always found SIL's software to be quite
          buggy, and William has mentioned on the show several times that the use of
          those slash-codes makes it unnecessarily difficult. That is an interface
          problem, of course, but it's worth noting that ease-of-use is
          something desperately needed for dictionary software.
        • Wm Annis
          ... There may be more than one target translation language. Many of SIL s dictionaries for languages of PNG, for example, also have definitions in Tok Pisin.
          Message 4 of 13 , Jul 1, 2012
          • 0 Attachment
            On Sun, Jul 1, 2012 at 3:08 AM, Logan Kearsley <chronosurfer@...> wrote:
            > Generating English-Lang, Lang-English formatted output. (only works if
            > entries contain sufficient data to match definitions in one language
            > to entries in another language)

            There may be more than one target translation language. Many of
            SIL's dictionaries for languages of PNG, for example, also have
            definitions in Tok Pisin. Here's a Bambara lexicon that uses both
            English and French:

            http://www.bambara.org/lexique/lexicon/main.htm

            The only conlang dictionary project I know that would use this is
            Na'vi.

            --
            William S. Annis
            www.aoidoi.org • www.scholiastae.org
          • Петр Рихардович Кларк
            ... Great! I was recently thinking of doing something similar, but I simply don t have the time to tackle it. I think you will find yourself very popular if
            Message 5 of 13 , Jul 1, 2012
            • 0 Attachment
              On Sunday, 01 July, 2012 12:08:57 Logan Kearsley wrote:
              > So, I've now got both conlanging and serious academic interest in
              > trying to solve this dictionary problem, and the programming skills to
              > back it up. There's no need to try tricking linguists into thinking
              > it's for them to drum up support, because it really genuinely is for
              > them, but for conlangers, too.
              Great! I was recently thinking of doing something similar, but I simply
              don't have the time to tackle it. I think you will find yourself very popular
              if you manage to pull it off!

              > Definitions for tags with information like "this is a guiding
              > metaphor, things that work within this metaphor have this tag"
              > Question- what all should be included in the description of a lexical
              > item? This is the biggest point of difference between languages, I
              > think.
              See my answer ↓ below.

              > And then these features for a program:
              > A lexical entry editing system that's configurable for different
              > languages (configurable how exactly?)
              The editing system should be customizable, so that it only displays
              those fields that are needed for the language in question. For example,
              different languages have different parts of speech and different parts of speech
              will have different fields. So the user should be able to customize that parts
              of speech there are, and then customize the different "forms" or array of fields
              that are displayed with the various parts of speech.
              So, for example, if I were customizing it for English, I would at a
              minimum need to create three different forms: one for nouns, verbs, and
              everything else. (Or, more likely, the forms for articles, conjunctions,
              adjectives, and adverbs would be similar, if not identical, to each other.)
              When I go to create a new entry, I am given a choice of forms, based on part
              of speech. So I choose "Verb", and the form for verbs appears, with fields for
              the infinitive, past tense, part participle, and their pronunciations,
              definition, normal usage, idiomatic usage, etc.
              Clearly, you can't predict what languages will need which fields, so
              either have a domain-specific language to create the fields, or simply have the
              user define the fields; the program doesn't even need to know what "past
              participle" means, just that there's a field for it, and that field is displayed
              with the form for verbs.

              > Any ideas for other things that ought to be included, both in terms of
              > data that ought to be storeable and features that a program should
              > have for manipulating it?
              Cross-platform. No question. It could be either stand-alone or browser-
              based; if browser-based, it needs to be able to run locally without requiring
              a lot of setup -- in other words, if I want to run it on my desktop, I don't
              need to setup Apache or something. Python has built-in web server
              capabilities, others probably are similar. There shouldn't be a lot of
              external dependencies, and any dependencies should be very forgiving, i.e., I
              don't need the latest and greatest extreme bleeding edge, but can use a
              library from three or four years ago. But it's best if it's as self-contained
              as possible.
              The default data format should be plain text (not a binary blob or
              database), preferably something easy for humans like like JSON or YAML. XML is
              another option, but the general trend seems to be to avoid it unless
              absolutely necessary.
              Should have the ability to include images.
              Should be able to create a style sheet for exporting to HTML, ODF, and
              PDF, so that we can have good-looking dictionaries without needing to post-
              processes them.
              :Peter
            • taliesin the storyteller
              ... Let me guess, the only languages these NLPers know well is safely Indo-European? They already know of GOLD I hope? http://linguistics-ontology.org/ TEI?
              Message 6 of 13 , Jul 1, 2012
              • 0 Attachment
                On 2012-07-01 10:08, Logan Kearsley wrote:
                > A bunch of people I work with recently attended the CALICO conference
                > (https://calico.org/) on using technology in language instruction, and
                > apparently one of the results of discussions there is that a bunch of
                > NLP researchers want to put together a common API for NLP tools and
                > resources, and one important type of resource is dictionaries.

                Let me guess, the only languages these NLPers know well is safely
                Indo-European?

                They already know of GOLD I hope? http://linguistics-ontology.org/
                TEI? http://www.tei-c.org/index.xml

                > SIL Shoebox has also been mentioned as "terrible"; for people who have
                > actually used it, what exactly is wrong with it?

                I use it a lot. It takes a bit to get used to and it is clearly built by
                dictionary makers, not programmers or interface experts. When looked at
                with CS-googles, it is not systematic enough, there are too many manual
                operations and it is too easy to put an element in the wrong order so
                that converting to other media becomes hard.

                On the other hand, a CS-based dictionary program would probably simplify
                too much and hide or throw away the very complexity that a linguist
                lives to analyse. There's a reason Shoebox comes out of the box with two
                different marker hierarchies, one where subentries (derived words) are
                found before/above the sense numbers, and one where they go
                after/beneath the sense-numbers.

                Here's a test-case for you: Greenlandic. There's this paper dictionary
                at the local library, it is *different*. I think I uploaded a scan of a
                page sometime...

                Some books:
                "A Handbook of Lexicography: The Theory and Practice of
                Dictionary-Making" by Bo Svensén
                "Dictionaries: The Art and Craft of Lexicography" by Sidney I. Landau
                "The Oxford Guide to Practical Lexicography" by B. T. Sue Atkins
                "Practical Lexicography: A Reader" by Thierry Fontenelle


                t.
              • David McCann
                On Sun, 1 Jul 2012 19:32:33 +0400 ... The OED uses SGML.
                Message 7 of 13 , Jul 1, 2012
                • 0 Attachment
                  On Sun, 1 Jul 2012 19:32:33 +0400
                  Петр Рихардович Кларк <pyotr.klark@...> wrote:

                  > The default data format should be plain text (not a binary
                  > blob or database), preferably something easy for humans like like
                  > JSON or YAML. XML is another option, but the general trend seems to
                  > be to avoid it unless absolutely necessary.

                  The OED uses SGML.
                • Amanda Babcock Furrow
                  ... Do you have an Athabaskan dictionary such as Young and Morgan s _Analytical Lexicon of Navajo_ or Jetté and Jones _Koyukon Athabaskan Dictionary_ handy?
                  Message 8 of 13 , Jul 1, 2012
                  • 0 Attachment
                    On Sun, Jul 01, 2012 at 02:08:57AM -0600, Logan Kearsley wrote:

                    > Any ideas for other things that ought to be included, both in terms of
                    > data that ought to be storeable and features that a program should
                    > have for manipulating it?

                    Do you have an Athabaskan dictionary such as Young and Morgan's _Analytical
                    Lexicon of Navajo_ or Jetté and Jones' _Koyukon Athabaskan Dictionary_
                    handy? (These are pricy!) The different organizational approach required
                    by a language where you can't actually list all the possible words would
                    probably be an eye-opener for any dictionary-formatting effort.

                    tylakèhlpë'fö,
                    Amanda
                  • Logan Kearsley
                    ... Oh, yes. I mentioned wrt BRAT the usefulness of configurable POS information, and that certainly applies here. So, we ll need some kind of configuration
                    Message 9 of 13 , Jul 2, 2012
                    • 0 Attachment
                      On 1 July 2012 02:19, George Corley <gacorley@...> wrote:
                      > What you have listed there is probably mandatory for most languages.  I
                      > would add word class (/ part of speech) -- which needs user-defined values,
                      > and possibly an optional etymology variable.

                      Oh, yes. I mentioned wrt BRAT the usefulness of configurable POS
                      information, and that certainly applies here. So, we'll need some kind
                      of configuration format for POS information.

                      >  In addition, I think it is
                      > absolutely essential to have, in addition to these stock variables, the
                      > option to create user-defined variables.

                      In order to fit the NLP-research goal, whatever the dictionary stores
                      has to be enough that the semantics are transparrent- and the data
                      therefore useful- to a machine. That pretty much means no arbitrary
                      extra fields outside of a well-defined schema. But the point of a
                      configuration system for POS data would be to provide a meta-schema by
                      which the semantics of some language-specific POS fields could be
                      defined, so if there's need for any other extra variables, some sort
                      of configuration system can probably be devised to describe them.

                      > In one language, you may have a number of irregular forms
                      > that you'll need to turn on for certain words -- and said irregular forms
                      > need to be labelled appropriately -- but the software can't possibly list
                      > all the linguistic terminology it would need for that in a way that would
                      > in any way be easy to use.

                      I was thinking that a list of paradigm slots could be part of the
                      definition of any particular word class. It would be really neat if we
                      could come up with some universally useful notation for morphological
                      processes such that the software could fill those in automatically.
                      But even without that, those slots for each lexical entry could have
                      values of "REGULAR", "DEFECTIVE", or a hand-entered irregular form.
                      The slots could be numbered in terms of preference for the citation
                      form- e.g., for verbs, normally use the infinitive as the citation
                      form, but fall back on some other form for defective verbs without
                      infinitives.

                      On 1 July 2012 02:24, George Corley <gacorley@...> wrote:
                      > Oh, I meant to say more: One language could need those irregular forms, but
                      > another might need gender information for nouns, or note what case an
                      > adposition takes.  Still another might need a numeral classifier listed for
                      > every noun, or case alignment or argument structure information for every
                      > verb.  It all depends on the language.

                      I was thinking that a language configuration would define a hierarchy
                      of nested word-classes, with each node having a bunch of variables
                      with enumerated values defined that are relevant for that word class.
                      I don't have it totally solid in my head (probably won't until I play
                      with the idea for a while), but something like:

                      NOUN:
                      Variables:
                      Gender: enumerated m, f, n
                      Classifier: CLASSIFIER
                      Forms: sg 1, pl 2

                      PREPOSITION:
                      Variables:
                      case: enumerated from cases

                      > Also, as regards Shoebox: It actually has much of what you'd need for a
                      > good dictionary.  However, I have always found SIL's software to be quite
                      > buggy, and William has mentioned on the show several times that the use of
                      > those slash-codes makes it unnecessarily difficult.  That is an interface
                      > problem, of course, but it's worth noting that ease-of-use is
                      > something desperately needed for dictionary software.

                      I suppose I shall have to look into it for data structure inspiration,
                      then, if not for user-interface design.

                      -l.
                    • Logan Kearsley
                      ... If you ve got any time at all, I will probably put my efforts up on github, and allow you (or anyone else who cares) to help out with it. ... Right; if
                      Message 10 of 13 , Jul 2, 2012
                      • 0 Attachment
                        On 1 July 2012 09:32, Петр Рихардович Кларк <pyotr.klark@...> wrote:
                        > On Sunday, 01 July, 2012 12:08:57 Logan Kearsley wrote:
                        >> So, I've now got both conlanging and serious academic interest in
                        >> trying to solve this dictionary problem, and the programming skills to
                        >> back it up. There's no need to try tricking linguists into thinking
                        >> it's for them to drum up support, because it really genuinely is for
                        >> them, but for conlangers, too.
                        >         Great! I was recently thinking of doing something similar, but I simply
                        > don't have the time to tackle it. I think you will find yourself very popular
                        > if you manage to pull it off!

                        If you've got any time at all, I will probably put my efforts up on
                        github, and allow you (or anyone else who cares) to help out with it.

                        >> And then these features for a program:
                        >> A lexical entry editing system that's configurable for different
                        >> languages (configurable how exactly?)
                        >         The editing system should be customizable, so that it only displays
                        > those fields that are needed for the language in question. For example,
                        > different languages have different parts of speech and different parts of speech
                        > will have different fields. So the user should be able to customize that parts
                        > of speech there are, and then customize the different "forms" or array of fields
                        > that are displayed with the various parts of speech.

                        Right; if you've got configuration files / a configuration editor for
                        defining language-specific word classes / morphological information,
                        then the user interface for editing entries ought to automatically
                        alter itself to conform to the needs of the schema defined in that
                        configuration. That shouldn't be too difficult, although making it
                        look pretty is definitely not my strong point.

                        >         Clearly, you can't predict what languages will need which fields, so
                        > either have a domain-specific language to create the fields, or simply have the
                        > user define the fields; the program doesn't even need to know what "past
                        > participle" means, just that there's a field for it, and that field is displayed
                        > with the form for verbs.

                        Ideally, the program could work out some relevant bits of what "past
                        participle" means, though. E.g., by including POS information for
                        every paradigm field for a particular word class. Not that that's
                        necessary for the conlangers' dictionary, though.

                        >         The default data format should be plain text (not a binary blob or
                        > database), preferably something easy for humans like like JSON or YAML. XML is
                        > another option, but the general trend seems to be to avoid it unless
                        > absolutely necessary.

                        I don't like XML. We've been steadily migrating away from XML-based
                        formats at work for the last couple of years. I'm most likely to use
                        JSON, 'cause it's simple and supported by pretty much everything, and
                        the easiest option if it does turn out to be a web-based application.

                        >         Should have the ability to include images.

                        Ah! That's something I had not thought of. Data URIs could do that.

                        >         Should be able to create a style sheet for exporting to HTML, ODF, and
                        > PDF, so that we can have good-looking dictionaries without needing to post-
                        > processes them.

                        Oh, absolutely. I'd need help writing an exporter for ODF, though.

                        On 1 July 2012 10:43, taliesin the storyteller <taliesin-conlang@...> wrote:
                        > On 2012-07-01 10:08, Logan Kearsley wrote:
                        >>
                        >> A bunch of people I work with recently attended the CALICO conference
                        >> (https://calico.org/) on using technology in language instruction, and
                        >> apparently one of the results of discussions there is that a bunch of
                        >> NLP researchers want to put together a common API for NLP tools and
                        >> resources, and one important type of resource is dictionaries.
                        >
                        >
                        > Let me guess, the only languages these NLPers know well is safely
                        > Indo-European?

                        Not having met all of them, I wouldn't know. I know one guy who knows
                        Maori, and a few who do Japanese, though.

                        > They already know of GOLD I hope? http://linguistics-ontology.org/
                        > TEI? http://www.tei-c.org/index.xml

                        I expect so. GOLD ought to be a useful reference for this sort of
                        project; the problems with TEI, though, are that 1) it is explicitly
                        not a standard, just a set of guidelines, so there's no way to be
                        build a machine to reliably parse a document just because it claims to
                        be TEI compliant, and 2) it's a really big meta-format; it doesn't
                        actually define what needs to be stored, just how to store it after
                        you've decided what you need. I was instructed to investigate it for
                        various projects last year, and in the end decided that, for those
                        reasons, it would be more time and cost effective to just define our
                        own proprietary formats for specific projects from scratch. Which
                        worked just fine in the short term, but of course continues to
                        contribute to the problem that a standard NLP API is intended to
                        solve.

                        Keep in mind, this is *not* a generic text-document markup problem.
                        For both the research and conlangery purposes, it's first a
                        storage-schema-definition problem. A schema and markup for translating
                        the stored data into a document comes second.

                        On 1 July 2012 11:59, Amanda Babcock Furrow <langs@...> wrote:
                        > On Sun, Jul 01, 2012 at 02:08:57AM -0600, Logan Kearsley wrote:
                        >
                        >> Any ideas for other things that ought to be included, both in terms of
                        >> data that ought to be storeable and features that a program should
                        >> have for manipulating it?
                        >
                        > Do you have an Athabaskan dictionary such as Young and Morgan's _Analytical
                        > Lexicon of Navajo_ or Jetté and Jones' _Koyukon Athabaskan Dictionary_
                        > handy?  (These are pricy!)  The different organizational approach required
                        > by a language where you can't actually list all the possible words would
                        > probably be an eye-opener for any dictionary-formatting effort.

                        Neither of those, but there does seem to be an "Ahtna Athabaskan
                        Dictionary" in the library. I might go check it out. But this gives me
                        an idea: could we put together a collection of example dictionary
                        entries from wildly different languages? Possibly with short
                        explanations of how they're set up and why they're set up that way. A
                        couple of scanned entries per dictionary with added explanatory text
                        ought to fall well within copyright fair use terms. Would that be the
                        kind of community project that the LCS would provide web space for?

                        -l.
                      • David Peterson
                        Delurking (sorry to note that I mostly lurk now). ... Yes indeed. Sounds like something I d like to contribute to—and I bet others would, as well. I m going
                        Message 11 of 13 , Jul 2, 2012
                        • 0 Attachment
                          Delurking (sorry to note that I mostly lurk now).

                          On Jul 2, 2012, at 1:08 PM, Logan Kearsley wrote:

                          > Neither of those, but there does seem to be an "Ahtna Athabaskan
                          > Dictionary" in the library. I might go check it out. But this gives me
                          > an idea: could we put together a collection of example dictionary
                          > entries from wildly different languages? Possibly with short
                          > explanations of how they're set up and why they're set up that way. A
                          > couple of scanned entries per dictionary with added explanatory text
                          > ought to fall well within copyright fair use terms. Would that be the
                          > kind of community project that the LCS would provide web space for?

                          Yes indeed. Sounds like something I'd like to contribute to—and I bet others would, as well. I'm going to contact you offlist about setting something up.

                          David Peterson
                          LCS President
                          president@...
                          www.conlang.org
                        • Logan Kearsley
                          I could not sleep last night because of thinking about this, so I got up and wrote it all down. I think I ve got a much more solid idea of what might be needed
                          Message 12 of 13 , Jul 3, 2012
                          • 0 Attachment
                            I could not sleep last night because of thinking about this, so I got
                            up and wrote it all down. I think I've got a much more solid idea of
                            what might be needed for a conlanger's dictionary tool, though I'm
                            less certain of well it would fit into NLP research goals, and it can
                            certainly still use a lot more work.

                            Configurations would be of the form:
                            """

                            CLASS
                            SUBCLASS
                            "Features":
                            Name: values
                            "Forms": [(Name, POS Info), etc.]
                            "Rules": [(Name, Rule), etc.]
                            "Lemmas": [list of form names]

                            @list enumeration, of, values

                            """

                            (Not necessarily actually written in that syntax, but with those
                            fields and structure.)

                            The things in quotes are actual literal field names, other things are
                            to-be-filled-in. Subclasses would be optional, and features & forms
                            could be specified at the class and/or subclass level.

                            The values for a feature can take the form of an explicit enumerated
                            list, or the name of some other class or a separately defined list.
                            E.g., nouns in a language that uses classifiers for counting could
                            have a 'classifier' feature whose values are 'any word in class
                            CLASSIFIER'. Or, one might have a class for, say, participles, which
                            have a feature "form of" whose possible values are 'any word in class
                            VERB'. Separately defined lists would be used for features that show
                            up throughout the language, like any kind of agreement. E.g., a
                            language with grammatical gender might define a list @Genders*, and
                            then just refer to that list by name whenever it is needed for
                            enumerating the possible values for gender.
                            There should also be the option of implicitly specifying lists/grids
                            of forms, in addition to explicitly named forms. E.g., verbs might
                            have their forms specified as "@Tenses X @Persons", and the software
                            would then know that it needs to generate form-slots for every
                            combination of tense and person for every verb.
                            Lemmas is a prioritized list of form-names that indicates which is the
                            citation form of the word. Most of the time, the first one in the list
                            would be used, but if that form is missing in a defective paradigm,
                            then the next would be used instead, and so forth.

                            Forms go together with rules. Rules are specified separately, rather
                            than as part of the definition of the form, so that they can be
                            assigned for auto-generated grid forms** and so that different rules
                            can be specified in each subclass for a common set of forms defined on
                            a class. I'm not totally sure what rules would look like yet, but it's
                            some way of writing down how to automatically generate a regular form
                            based on other forms of a lexeme. If a form does not have a rule
                            associated with it, then the software would always ask you to fill it
                            in for lexemes of that particular class; that's how you'd, e.g.,
                            specify "principal parts". You would then have the option to fill in
                            any other form explicitly (thus allowing for irregulars), or to mark a
                            form as absent (thus indicating that the lexeme has a defective
                            paradigm).

                            There would be one special list for "representations" whose items
                            would be things like "native script", "IPA transcription", etc. Every
                            form would then have multiple possible representations which would
                            have to be filled in separately (unless there're rules defined to
                            automatically generate one from another). This allows for storing
                            spelling vs. pronunciation, for example.

                            The dictionary itself would contain Lexical entries and Note entries.

                            Note entries would just be names and some free-form text, mainly
                            intended for describing things like important cultural points or
                            common metaphors in the language.

                            Lexical entries would have fields determined by the word class
                            configurations and a list of structures for actual definitions /
                            senses. At both the Class and Sense level there would be an optional
                            "Notes" field for free-form text not intended to be
                            machine-interpretable. There would also be an optional "Etymology"
                            field that could be specified either at the Class level or for each
                            individual sense which would give the type of derivation (descent from
                            parent language, borrowing, calquing, etc.) and the original form +
                            source language upt to 1 step back; you don't need more than one step
                            because the whole chain of etymology can then be automatically
                            constructued by looking up the dictionary entries for the next form
                            back up the chain.

                            Each sense would contain a list of (definition, language), thus
                            allowing for auto-generation of lang1-lang2 and lang2-lang1 dictionary
                            output, and a list of Examples. Examples would contain either a
                            literal string of text using the word, or else a reference (some sort
                            of URI) to its use in some corpus, an optional list of tags
                            associating the example with Notes, and an optional list of Media
                            resources (images, sound clips of the example being spoken, etc.)

                            Lexical entries that share the same lemma form could be grouped
                            together for output/display.

                            That's all I can think of right now. Keep in mind this is completely
                            in terms of abstract data structures. It would still remain to design
                            some kind of templates or stylesheets for actually outputing a
                            printable, human-readable dictionary.

                            Anybody got example lexicography problems that can not be adequately
                            represented by this system? I don't as yet have a good idea of how
                            best to encode syntactic information like argument structure or
                            subcategorizations. Also, anybody got ideas for how to represent rules
                            for morphological processes?

                            -l.

                            *making use of '@' in its Perl-sigil sense to denote the name of a list
                            **although, perhaps there's no real savings if you have to actually
                            type out all of the rules for the entire grid anyway.
                          • Roger Mills
                            If you want, you could take a look at my two dictionaries and see where they would fit in. http://cinduworld.tripod.com/anakrangota.htm (essentially the way
                            Message 13 of 13 , Jul 3, 2012
                            • 0 Attachment
                              If you want, you could take a look at my two "dictionaries" and see where they would fit in.

                              http://cinduworld.tripod.com/anakrangota.htm (essentially the way I like it :-)))  Compounds are often irregular, and there are quite a few morpho-phonological rules  in http://cinduworld.tripod.com/kashphon.htm

                              http://cinduworld.tripod.com/prevliwordlist.htm (still in progress)


                              ________________________________
                              From: Logan Kearsley <chronosurfer@...>

                              (big snip)
                              Anybody got example lexicography problems that can not be adequately
                              represented by this system? I don't as yet have a good idea of how
                              best to encode syntactic information like argument structure or
                              subcategorizations. Also, anybody got ideas for how to represent rules
                              for morphological processes?
                            Your message has been successfully submitted and would be delivered to recipients shortly.