Loading ...
Sorry, an error occurred while loading the content.

Unicode encoding of Tengwar

Expand Messages
  • david_hopwood2002
    Just a note to point out that there is some discussion of encoding of Tengwar in Unicode (in particular, how to handle the different semantics of tehtar in
    Message 1 of 11 , Jan 5, 2002
    • 0 Attachment
      Just a note to point out that there is some discussion of encoding
      of Tengwar in Unicode (in particular, how to handle the different
      semantics of tehtar in Quenya, Sindarin and Beleriand-style modes)
      on the Unicode mailing list:

      Subscription:
      <http://www.unicode.org/unicode/consortium/distlist.html>
      Archive: <http://groups.yahoo.com/group/unicode/messages/>

      Here is a slightly modified copy of my proposal in that thread -
      please cc: any replies to <david.hopwood@...>. Note that
      the Unicode list does not allow postings unless you've subscribed.


      Kenneth Whistler wrote:
      > Michael Everson wrote:
      > > http://www.evertype.com/standards/iso10646/pdf/tengwar-vowels.pdf
      > > http://www.evertype.com/standards/iso10646/pdf/tengwar.pdf
      >
      > Maybe I haven't read these carefully enough, but it appears to
      > me that the analysis you provide in tengwar-vowels.pdf (which I
      > find myself in agreement with) doesn't match your statement about
      > the vowels (tehtar) in tengwar.pdf,

      Note that there are three different proposals in the two papers.

      > where you claim the tehtar
      > are not combining marks, that the logical order is the same for
      > both Quenya and Sindarin modes, and that display is a matter of
      > picking out ligatures that ligate the tehtar with preceding base
      > letters in Quenya and following base letters in Sindarin.

      An alternative (fourth) approach encodes the tehtar twice: a set of
      "preceding tehtar", and a set of "following tehtar". Equivalently,
      the vowels can be thought of as being in three sets: Quenya-style,
      Sindarind-style and Beleriand-style (the correspondance between these
      sets should be reflected in the encoding).

      The fact that some tehtar precede the character they are applied to
      is not
      really a significant problem IMHO; there are already far more
      complicated
      cases of conjoining characters that don't follow a simple "base +
      diacritic"
      model in Unicode (e.g. in Hangul and Tamil). AFAICS, the main reason
      for the
      "combining characters follow the base character" rule is to allow for
      consistent canonicalisation, but that is not a problem here, I don't
      think:
      the preceding tehtar can be given combining class 0.

      Here are some of Michael Everson's examples encoded using this
      approach:

      +e = following tehta above
      _e = following tehta below
      e+ = preceding tehta above
      e = full vowel

      language/style word encoding
      -----------------------------------------------------
      Quenya nelde n +e ld +e
      Quenya neltildi n +e l t +i ld +i
      Sindarin neled n e+ l e+ d
      Sindarin nelthil n e+ l th i+ l
      Beleriand neled n e l e d
      Beleriand nelthil n e l th i l
      English/Quenya animal ^ +a n +i m +a l
      English/Sindarin animal a+ n i+ m a+ l
      English/Beleriand animal a n i m a l
      Old English mihton m i+ h ZWJ t o+ n
      Old English <thorn><ae>re th ae+ r _e

      [A much nicer .gif illustration of this is at
      <http://www.users.zetnet.co.uk/hopwood/unicode/tengwar.gif>.]

      A minor modification is needed to the grapheme breaking rules.
      [snip some technical details]

      If a "preceding tehta" is at the end of a string, it is treated as a
      spacing character. COMBINING DOT ABOVE and COMBINING ACUTE ACCENT are
      used for Beleriand as in tengwar.pdf.


      Advantages:
      - preserves the logical structure (including the language style of
      each
      word, and therefore the pronunciation).

      - no need for two font types, unlike the proposal in tengwar.pdf.

      - each logical element of the script corresponds to exactly one
      Unicode
      character (except for use of ZWJ for true ligatures).

      [I've changed my mind since posting this - ligature-forming should be
      performed by default, and ZWNJ (zero-width non-joiner) should be used
      to inhibit ligatures. So the correspondence between logical elements
      and
      Unicode characters is even closer.]

      - straightforward one-to-one transliteration without reordering is
      possible between the Quenya, Sindarin, and Beleriand styles, and
      Latin
      script (except for adding carriers [*]).

      - no problems for collation: it's easy to sort this encoding
      according to
      pronunciation. Carriers would be ignorable.

      - completely straightforward input - the language style determines
      which
      vowel character is produced when the user types a given Latin
      vowel.

      - more natural encoding of vowels following a consonant for Old
      English;
      use both preceding and following tehtar as appropriate (see
      "<thorn><ae>re" example above, which would have to be encoded as
      "th ae r ZWJ e-below" in the tengwar.pdf proposal).

      - no problems with canonicalisation or grapheme breaking, provided
      preceding tehtar are given the correct properties. Grapheme breaks
      reflect the syllabic language structure.

      [*] Alternatively, two consecutive tehtar, or a following tehta at
      the
      beginning of a word, or a preceding tehta at the end of a word,
      could
      could be considered to imply a carrier. The advantage of that
      would
      be closer correspondance to the underlying language, but it
      requires
      more complex rendering; on the whole I think carriers should
      probably
      be encoded explicitly.

      Disadvantage:
      - requires fonts to be able to place a mark over the following
      character
      rather than the preceding one.

      I doubt that the font issue is a serious problem when using OpenType
      or
      similar. Even a very simple font can treat the preceding tehtar as
      zero-
      width overlays that extend to the right (as usual this doesn't take
      account of character widths or heights, but it's an acceptable
      fallback).

      I think it's also significant that one of the main original purposes
      of
      Tengwar was as a way of exploring script and language structure
      (whether
      of fictional languages or real ones). If the encoding doesn't reflect
      that
      structure, then what's the point?

      --
      David Hopwood <david.hopwood@...>

      Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
      RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66
      15 01
      Nothing in this message is intended to be legally binding. If I
      revoke a
      public key but refuse to specify why, it is because the private key
      has been
      seized under the Regulation of Investigatory Powers Act; see
      www.fipr.org/rip
    • gildir_1
      ... I don t understand why there should be two flavours of tehtar, one preceding and one following . They are the same. And shouldn t the /ae/ ligature in
      Message 2 of 11 , Jan 6, 2002
      • 0 Attachment
        --- In elfscript@y..., "david_hopwood2002" <david.hopwood@z...> wrote:

        I don't understand why there should be two flavours of tehtar,
        one 'preceding' and one 'following'. They are the same.

        And shouldn't the /ae/ ligature in the example be transcribed
        with the /ae/ tehta, i.e. an upside-down a-tehta?

        Suilaid,
        Gildir, Per Lindberg
      • Anthony J. Bryant
        ... But the *languages* using them *aren t*. Tony
        Message 3 of 11 , Jan 6, 2002
        • 0 Attachment
          gildir_1 wrote:

          > --- In elfscript@y..., "david_hopwood2002" <david.hopwood@z...> wrote:
          >
          > I don't understand why there should be two flavours of tehtar,
          > one 'preceding' and one 'following'. They are the same.
          >

          But the *languages* using them *aren't*.


          Tony
        • gildir_1
          ... So? By the way, Tolkien used the tengwar for more than two languages. And what about other (real-world) languages? I don t think there should be one
          Message 4 of 11 , Jan 7, 2002
          • 0 Attachment
            --- In elfscript@y..., "Anthony J. Bryant" <ajbryant@i...> wrote:
            > gildir_1 wrote:
            >
            > > --- In elfscript@y..., "david_hopwood2002" <david.hopwood@z...>
            wrote:
            > >
            > > I don't understand why there should be two flavours of tehtar,
            > > one 'preceding' and one 'following'. They are the same.
            > >
            >
            > But the *languages* using them *aren't*.

            So?

            By the way, Tolkien used the tengwar for more than two
            languages. And what about other (real-world) languages?
            I don't think there should be one flavour of tehtar (or,
            for that matter, tengwar) for each language, nor for any
            (necessarily arbitrary) group of languages.

            Any standardisation work should take into consideration
            *all* attested specimina of tengwar written by Tolkien.
            An attempt to make a list is here:

            http://www.forodrim.org/daeron/mdtci.html

            However, there is probably a large amount of tengwar
            written by Tolkien (such as his diary) yet unpublished,
            so any attempt of standardization should take that into
            consideration as well.

            Suilaid!
            Gildir, Per Lindberg
          • Måns Björkman
            Gildir wrote (with reference to , where Michael Everson writes: In Old English, the vowel
            Message 5 of 11 , Jan 7, 2002
            • 0 Attachment
              Gildir wrote (with reference to
              <http://www.evertype.com/standards/iso10646/pdf/tengwar.pdf>,
              where Michael Everson writes: "In Old English, the vowel preceding a
              consonant is written above it, and following a consonant is written
              after it: [...] þære 'of it'."):

              > And shouldn't the /ae/ ligature in the example be transcribed
              > with the /ae/ tehta, i.e. an upside-down a-tehta?


              No, that tehta is not used in the Old English mode described in The
              Notion Club Papers. Michael's description is based on the first version
              of Lowdham's manuscript, where /æ/ is represented by what is normally
              used as the a-tehta.

              In the second version, /æ/ is represented by the double dots instead
              (the Quenya y-tehta), and the reading order is changed so that the
              tengwa is read first, then the superscripted tehta, then the subscripted
              tehta. I recall commenting on this when the Unicode proposal was last
              discussed on this list back in '00. Apparently Michel opted to keep the
              reading of the first version.

              Yours,
              Måns

              --
              Måns Björkman "Mun þu mik!
              Störtloppsvägen 8, III Man þik.
              SE-129 46 Hägersten Un þu mer!
              Sweden http://hem.passagen.se/mansb An þer."
            • John Cowan
              ... They are semantically the same, but differ in rendering. Using the preceding versions, fonts will know to place the tehta on the next tengwa, whereas
              Message 6 of 11 , Jan 7, 2002
              • 0 Attachment
                gildir_1 wrote:


                > I don't understand why there should be two flavours of tehtar,
                > one 'preceding' and one 'following'. They are the same.

                They are semantically the same, but differ in rendering. Using the
                preceding versions, fonts will know to place the tehta on the next
                tengwa, whereas using the following versions, the tehta will be placed
                on the previous tengwa. Both of these can be achieved easily
                by ligature tables.

                The standard Unicode order is always following, which means
                that a word like "sindarin" would have to be encoded
                "snidrani" in order to be viewed correctly.

                --
                Not to perambulate || John Cowan <jcowan@...>
                the corridors || http://www.reutershealth.com
                during the hours of repose || http://www.ccil.org/~cowan
                in the boots of ascension. \\ Sign in Austrian ski-resort hotel
              • Måns Björkman
                ... Is that so? Would not snidrani place the tehtar over the following tengwa? The standard placement in Quenya is over the *previous* tengwa (and the word
                Message 7 of 11 , Jan 8, 2002
                • 0 Attachment
                  John Cowan wrote:

                  >
                  > The standard Unicode order is always following, which means
                  > that a word like "sindarin" would have to be encoded
                  > "snidrani" in order to be viewed correctly.


                  Is that so? Would not "snidrani" place the tehtar over the following
                  tengwa? The standard placement in Quenya is over the *previous* tengwa
                  (and the word "Sindarin" is Quenya).

                  Yours,

                  Måns



                  --
                  Måns Björkman "Mun þu mik!
                  Störtloppsvägen 8, III Man þik.
                  SE-129 46 Hägersten Un þu mer!
                  Sweden http://hem.passagen.se/mansb An þer."
                • John Cowan
                  ... Ah. In that case, what is the Sindarin word for Sindarin ? -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Please leave
                  Message 8 of 11 , Jan 9, 2002
                  • 0 Attachment
                    Måns Björkman scripsit:

                    > (and the word "Sindarin" is Quenya).

                    Ah. In that case, what is the Sindarin word for "Sindarin"?

                    --
                    John Cowan http://www.ccil.org/~cowan cowan@...
                    Please leave your values | Check your assumptions. In fact,
                    at the front desk. | check your assumptions at the door.
                    --sign in Paris hotel | --Miles Vorkosigan
                  • Måns Björkman
                    ... Debatable. According to _Quendi and Eldar_, Their own language was the only one that they ever heard; and they needed no word to distinguish it, nor to
                    Message 9 of 11 , Jan 9, 2002
                    • 0 Attachment
                      John Cowan wrote:

                      > Måns Björkman scripsit:
                      >
                      >
                      >>(and the word "Sindarin" is Quenya).
                      >>
                      >
                      > Ah. In that case, what is the Sindarin word for "Sindarin"?


                      Debatable. According to _Quendi and Eldar_, "Their own language was the
                      only one that they ever heard; and they needed no word to distinguish
                      it, nor to distinguish themselves." Certainly some designation must have
                      been created after the arrival of the Noldor, but we do not know what it
                      was. It has been suggested that the Sindar referred to their language
                      simply as _Edhellen_, "Elvish".

                      Måns


                      --
                      Måns Björkman "Mun þu mik!
                      Störtloppsvägen 8, III Man þik.
                      SE-129 46 Hägersten Un þu mer!
                      Sweden http://hem.passagen.se/mansb An þer."
                    • Gildor Inglorion
                      teithant John Cowan ... * thindren :)))) ... Do You Yahoo!? Αποκτήστε την δωρεάν σας@yahoo.gr διεύθυνση στο Yahoo! Mail.
                      Message 10 of 11 , Jan 9, 2002
                      • 0 Attachment

                        teithant John Cowan

                        > Ah.  In that case, what is the Sindarin word for "Sindarin"

                        * thindren  :))))



                        Do You Yahoo!?
                        Αποκτήστε την δωρεάν σας @... διεύθυνση στο Yahoo! Mail.
                      • Michael Everson
                        ... w it.Yes, this is a bit of a problem for input and searching given most people s= expectations. Yes, things like thære are a problem.
                        Message 11 of 11 , Aug 11 1:37 AM
                        • 0 Attachment
                          --- In elfscript@yahoogroups.com, "david_hopwood2002" <
                          david.hopwood@z...> wrote:

                          > An alternative (fourth) approach encodes the tehtar twice: a set of
                          > "preceding tehtar", and a set of "following tehtar". Equivalently,
                          > the vowels can be thought of as being in three sets: Quenya-style,
                          > Sindarind-style and Beleriand-style (the correspondance between these
                          > sets should be reflected in the encoding).
                          >
                          > Here are some of Michael Everson's examples encoded using this
                          > approach:
                          >
                          > +e = following tehta above
                          > _e = following tehta below
                          > e+ = preceding tehta above
                          > e = full vowel
                          >
                          > language/style word encoding
                          > -----------------------------------------------------
                          > Quenya nelde n +e ld +e
                          > Quenya neltildi n +e l t +i ld +i
                          > Sindarin neled n e+ l e+ d
                          > Sindarin nelthil n e+ l th i+ l
                          > Beleriand neled n e l e d
                          > Beleriand nelthil n e l th i l
                          > English/Quenya animal ^ +a n +i m +a l
                          > English/Sindarin animal a+ n i+ m a+ l
                          > English/Beleriand animal a n i m a l
                          > Old English mihton m i+ h ZWJ t o+ n
                          > Old English <thorn><ae>re th ae+ r _e
                          >
                          > [A much nicer .gif illustration of this is at
                          > <http://www.users.zetnet.co.uk/hopwood/unicode/tengwar.gif>.]

                          This won't work. If the tehtar are defined as combining marks, they MUST
                          follow a base character. Thus

                          ee-
                          nld = neled i.e. ne-le-d; writing

                          -ee
                          nld = nlede i.e. n-le-de
                          Though of course you can read (pronounce) this any way you like; Unicode
                          does not allow a combining character to attach itself to letters that follo=
                          w it.

                          Yes, this is a bit of a problem for input and searching given most people's=

                          expectations. Yes, things like thære are a problem.
                        Your message has been successfully submitted and would be delivered to recipients shortly.