Loading ...
Sorry, an error occurred while loading the content.

Re: character sets

Expand Messages
  • Antoine J. Mechelynck
    ... From: jonah To: vim Sent: Sunday, February 01, 2004 8:36 PM Subject: character sets ... settings ... page,
    Message 1 of 24 , Feb 1, 2004
    • 0 Attachment
      ----- Original Message -----
      From: "jonah" <jonahgoldstein@...>
      To: "vim" <vim@...>
      Sent: Sunday, February 01, 2004 8:36 PM
      Subject: character sets


      > Hi,
      >
      > I'm curious about which characters can be displayed in vim, and which
      settings
      > affect this.
      >
      > For example, I cut and pasted some html code into vim to create a new html
      page,
      > and the original code contained smart quotes. The two characters (both
      left and
      > right smart quote) display in vim as a thick black boxes. However, when I
      view
      > the new html page they display properly, as smart quotes. What is
      preventing
      > them from displaying properly within vim? Is this the character set I'm
      using?
      > The font I'm using?
      >
      > Thanks for any info,
      > Jonah
      >
      >
      >

      If you can't display a character properly, there may be several reasons:

      o The character must exist in your current 'encoding' (global).

      o The character must exist in your current 'fileencoding' (local to
      buffer).

      o If you're using the GUI, your current 'guifont' (global) must have a
      glyph for it.

      o If you are not using the GUI, your console terminal must be set to the
      proper display codepage.

      o Your current 'fileencoding' must correspond to the character set which
      was used to save the file.

      see, among others,
      :help 'encoding'
      :help 'fileencoding'
      :help 'fileencodings'
      :help 'termencoding'
      :help 'guifont'
      :help :language

      HTH,
      Tony.
    • Jonathan D Johnston
      On Sun, 1 Feb 2004 11:36:37 -0800, jonah wrote: [...] ... Hi Jonah, What are the hex values for these smart quotes ? In Vim,
      Message 2 of 24 , Feb 3, 2004
      • 0 Attachment
        On Sun, 1 Feb 2004 11:36:37 -0800,
        "jonah" <jonahgoldstein@...> wrote:
        [...]
        > For example, I cut and pasted some html code into vim to create a new
        > html page, and the original code contained smart quotes. The two
        > characters (both left and right smart quote) display in vim as a thick
        > black boxes. However, when I view the new html page they display
        > properly, as smart quotes. What is preventing them from displaying
        > properly within vim? Is this the character set I'm using? The font
        > I'm using?

        Hi Jonah,

        What are the hex values for these "smart quotes"? In Vim, place the
        cursor on one of the black boxes and type
        :ascii
        or
        ga

        Are the hex values 0x93 (left quote) & 0x94 (right quote)? If so, these
        are Microsoft specific characters - they're not defined in latin1,
        Unicode, or any other standard character set. They should *not* be used
        on the WWW; only those browsing from a M$ OS will be able to see them.

        Last I knew, there was a good discussion about these & other M$ specific
        characters on John Walker's website. I don't have the URL, but try
        searching for
        John Walker demoroniser fourmilab

        HTH,
        Jonathan D Johnston

        ________________________________________________________________
        The best thing to hit the Internet in years - Juno SpeedBand!
        Surf the Web up to FIVE TIMES FASTER!
        Only $14.95/ month - visit www.juno.com to sign up today!
      • Antoine J. Mechelynck
        ... Everything can be represented in Unicode (or will be), i.e., not only Cuneiform, Klingon and Fëanorian letters (to name just a few), but also glyphs
        Message 3 of 24 , Feb 3, 2004
        • 0 Attachment
          Jonathan D Johnston <jdjohnston2@...> wrote:
          > On Sun, 1 Feb 2004 11:36:37 -0800,
          > "jonah" <jonahgoldstein@...> wrote:
          > [...]
          > > For example, I cut and pasted some html code into vim to create a
          > > new html page, and the original code contained smart quotes. The
          > > two characters (both left and right smart quote) display in vim as
          > > a thick black boxes. However, when I view the new html page they
          > > display properly, as smart quotes. What is preventing them from
          > > displaying properly within vim? Is this the character set I'm
          > > using? The font I'm using?
          >
          > Hi Jonah,
          >
          > What are the hex values for these "smart quotes"? In Vim, place the
          > cursor on one of the black boxes and type
          > :ascii
          > or
          > ga
          >
          > Are the hex values 0x93 (left quote) & 0x94 (right quote)? If so,
          > these are Microsoft specific characters - they're not defined in
          > latin1, Unicode, or any other standard character set. They should
          > *not* be used on the WWW; only those browsing from a M$ OS will be
          > able to see them.
          >
          > Last I knew, there was a good discussion about these & other M$
          > specific characters on John Walker's website. I don't have the URL,
          > but try searching for
          > John Walker demoroniser fourmilab
          >
          > HTH,
          > Jonathan D Johnston
          >
          > ________________________________________________________________
          > The best thing to hit the Internet in years - Juno SpeedBand!
          > Surf the Web up to FIVE TIMES FASTER!
          > Only $14.95/ month - visit www.juno.com to sign up today!

          Everything can be represented in Unicode (or will be), i.e., not only
          Cuneiform, Klingon and Fëanorian letters (to name just a few), but also
          glyphs otherwise unknown to anyone other than Microsoft users.

          W98 notepad's usual font shows characters 148 and 149 decimal as black
          boxes; when entered (cf. "help i_CTRL-V_digit") in gvim with default
          (latin1) 'encoding' then imported via the clipboard into a gvim with utf-8
          'encoding' I get glyphs non symmetrical to each other, and codepoints U+201D
          and U+2022

          I suspect what the OP calls smart quotes are what is better known as French
          quotes, or double angled brackets (« and »), codepoints U+00AB and U+00BB,
          but of course I can't be sure.

          Regards,
          Tony.
        • François Pinard
          [Antoine J. Mechelynck] ... Unicode is some kind of mirage for too many of us. In the quote above, Everything is much more diversified than one may think,
          Message 4 of 24 , Feb 3, 2004
          • 0 Attachment
            [Antoine J. Mechelynck]

            > Everything can be represented in Unicode (or will be), [...]

            Unicode is some kind of mirage for too many of us. In the quote above,
            "Everything" is much more diversified than one may think, and "will be"
            widely underestimate all the politics involved behind Unicode. Things
            are neither so pure nor so simple in practice.

            One of the problems behind the mirage is all those induced religious
            or fanatic feelings towards Unicode. Do not read me as anti-Unicode,
            there are good things in there, and I'm glad to see that it is acquiring
            better support, slowly, a bit everywhere. I support it in my little
            things whenever reasonable to do so. But the exaggerated hopes conveyed
            with Unicode also do a lot of damage, would it be only because many
            people stop looking for other or better avenues, and just stand still.

            P.S. - About Klingon :-), I read (but did not check) that it was removed
            in some later version of Unicode. Maybe it has been moved elsewhere?

            --
            François Pinard http://www.iro.umontreal.ca/~pinard
          • Antoine J. Mechelynck
            ... From: François Pinard To: Antoine J. Mechelynck Cc: ;
            Message 5 of 24 , Feb 3, 2004
            • 0 Attachment
              ----- Original Message -----
              From: "François Pinard" <pinard@...>
              To: "Antoine J. Mechelynck" <antoine.mechelynck@...>
              Cc: <jonahgoldstein@...>; "Jonathan D Johnston"
              <jdjohnston2@...>; <vim@...>
              Sent: Tuesday, February 03, 2004 9:00 PM
              Subject: Re: character sets


              > [Antoine J. Mechelynck]
              >
              > > Everything can be represented in Unicode (or will be), [...]
              >
              > Unicode is some kind of mirage for too many of us. In the quote above,
              > "Everything" is much more diversified than one may think, and "will be"
              > widely underestimate all the politics involved behind Unicode. Things
              > are neither so pure nor so simple in practice.
              >
              > One of the problems behind the mirage is all those induced religious
              > or fanatic feelings towards Unicode. Do not read me as anti-Unicode,
              > there are good things in there, and I'm glad to see that it is acquiring
              > better support, slowly, a bit everywhere. I support it in my little
              > things whenever reasonable to do so. But the exaggerated hopes conveyed
              > with Unicode also do a lot of damage, would it be only because many
              > people stop looking for other or better avenues, and just stand still.
              >
              > P.S. - About Klingon :-), I read (but did not check) that it was removed
              > in some later version of Unicode. Maybe it has been moved elsewhere?
              >
              > --
              > François Pinard http://www.iro.umontreal.ca/~pinard
              >

              Removed? I thought one of the basic tenets of Unicode was that nothing would
              ever be removed? There goes another illusion. Well, replace it by Angerthas
              or Maya, at your choice.

              As for the more basic question -- I use Unicode, but not for _everything_.
              However, for some applications it is irreplaceable. (See my homepage
              http://users.skynet.be/antoine.mechelynck/ to see what I mean). And don't
              tell me I could have used entities instead of "charset=utf-8": it's true,
              but entities are essentially a 7-bit ASCII representation of Unicode, and an
              illegible one when used for any non-Latin writing system.

              Regards,
              Tony.
            • François Pinard
              [Antoine J. Mechelynck] ... Unicode changed a lot since its inception. There has also been the influence of ISO 10646, and the practical convergence of both.
              Message 6 of 24 , Feb 3, 2004
              • 0 Attachment
                [Antoine J. Mechelynck]

                > I thought one of the basic tenets of Unicode was that nothing would
                > ever be removed? There goes another illusion.

                Unicode changed a lot since its inception. There has also been the
                influence of ISO 10646, and the practical convergence of both.

                The most widespread Unicode illusion is still, probably, about the
                1-1 correspondence between codes and characters. It requires some
                doing before a program can address the N'th character of an in-memory
                Unicode string in constant time: the used representation is usually
                _not_ pure Unicode. Some characters require combined forms for being
                produced, while others (or the same) exist pre-combined. UTF-16 has
                been integrated into the standard, so irrelevant to combining, some
                characters require two codes. Add overhead codes for directionality,
                and other phenomena, you are away from simplicity. Consider many levels
                of conformance, and many editions of the Unicode standard over the
                years, including shuffling of massive code blocks between versions, you
                now have something really complex to implement and support.

                Unicode is really a matter of specialists, and this is shocking, given
                that character handling should be bread and butter of all computer
                programmers, whatever the country is, rich or poor. No doubt that
                creating a speciality also creates jobs for specialist, and a market if
                you can force your big standard all around. But you also condemn the
                less technical countries to colonialism and all the abuse going with it.

                Even in the richer countries, almost every Unicode application today is
                more or less broken, as about nobody is able to support it as it should.
                No doubt that people are all excited when they get some working UTF-8,
                and indeed, it is much fun seeing multi-lingual Web pages. The truth
                is that a lot of packages boldly state they support Unicode, as soon
                as they handle 16-bit characters and have the usual UTF-8 and Latin-x
                conversions, but this is usually still quite far from the real thing.

                For one, I'm glad that Vim "supports" Unicode, that GTK supports it,
                that Python supports it, that Pango exists. Those are good news, and
                even extremely good news when you have no alternative handy. It might
                require years before German and French really leave ISO 8859-1 (or -15),
                and similarly for others. It will surely require a _lot_ of years
                before American really leave ASCII :-). Let's face it: without Microsoft
                monopoly, Unicode would likely never make it in Asian and Eastern
                countries. Oh, it might become strong in countries not already in
                control of their software engineering, Unicode will keep them captive.

                My point here is not to say that Unicode is bad, but only to stress a
                bit that it is not the wonder that some blindly think it is. And also,
                on the same blow, to say that Unicode fanatics are dangerous people! :-)

                > However, for some applications it is irreplaceable. (See my homepage
                > http://users.skynet.be/antoine.mechelynck/ to see what I mean).

                Nice indeed, no doubt! Congratulations!

                > And don't tell me I could have used entities [...]

                I agree that entities are much abused in HTML. Even ` ', ubiquitous
                on the Web, would be advantageously replaced by the real thing! The
                likely reason for   is that Vim is still not popular enough! :-)

                --
                François Pinard http://www.iro.umontreal.ca/~pinard
              • Antoine J. Mechelynck
                ... Unicode merely (!) ranks glyphs (and some control codes) on an integral scale going from 0 to some large number. It can be reprsented electronically in
                Message 7 of 24 , Feb 3, 2004
                • 0 Attachment
                  François Pinard <pinard@...> wrote:
                  > [Antoine J. Mechelynck]
                  >
                  > > I thought one of the basic tenets of Unicode was that nothing would
                  > > ever be removed? There goes another illusion.
                  >
                  > Unicode changed a lot since its inception. There has also been the
                  > influence of ISO 10646, and the practical convergence of both.
                  >
                  > The most widespread Unicode illusion is still, probably, about the
                  > 1-1 correspondence between codes and characters.

                  Unicode merely (!) ranks glyphs (and some control codes) on an integral
                  scale going from 0 to some large number. It can be reprsented electronically
                  in various ways, such as UTF-8 (from 1 to 6 bytes per codepoint in theory,
                  but no more than 4 "in any foreseeable future"), UTF-16 (1 or sometimes 2
                  16-bit words per codepoint), UTF-32 (32 bits per codepoint). The latter (in
                  either of its endian variants) is fixed-size but horribly wasteful of space.
                  That's 10 different encodings if you take endianness and presence or absence
                  of a BOM into account. (Not including the proposals I've seen for mixed
                  endianness, using a BOM to set endianness at any point in the middle of a
                  UTF-16 text.)

                  > It requires some
                  > doing before a program can address the N'th character of an in-memory
                  > Unicode string in constant time: the used representation is usually
                  > _not_ pure Unicode.

                  I suppose you mean a UTF-8 string, and I agree. Finding the Nth character of
                  an ASCII string is an addressing matter: add N to the start address of the
                  string, possibly check for out-of-bounds, that's it. Finding the Nth
                  character in a UTF-8 string requires examining the lead byte of each byte
                  sequence in turn to determine where the next character starts. And that's
                  before skipping (or not) combining characters, zero-width characters and/or
                  control characters. I've seen some texts which seem to imply that all
                  Unicode should move toward UTF-32 (where addressing the Nth character of a
                  string _is_ straightforward, at least if combining characters and control
                  characters are counted separately) but somehow I'm not convinced.

                  > Some characters require combined forms for being
                  > produced, while others (or the same) exist pre-combined. UTF-16 has
                  > been integrated into the standard, so irrelevant to combining, some
                  > characters require two codes. Add overhead codes for directionality,
                  > and other phenomena, you are away from simplicity. Consider many
                  > levels of conformance, and many editions of the Unicode standard over
                  > the years, including shuffling of massive code blocks between
                  > versions, you now have something really complex to implement and
                  > support.

                  Yeah. Just understanding it (or trying to) requires poring through sheafs of
                  nebulous, verbose documentation. (I won't say it isn't precise in its
                  long-winded way; what I'm saying is that it isn't easy to fathom.) Then
                  someone has to implement it (or try to).
                  >
                  > Unicode is really a matter of specialists, and this is shocking, given
                  > that character handling should be bread and butter of all computer
                  > programmers, whatever the country is, rich or poor.

                  I seem to have become a kind of Unicode specialist for Vim at the How-To and
                  scripting level, just because nobody wanted to do it, but I'm not gonna
                  claim I understand the system. Just that I know where to look in Vim's
                  documentation, or what settings to tweak, to make it work (somehow). I'm
                  sure there are other people hereabouts who understand it better than I.

                  It does take some getting used to. (Well, in CJK countries it takes all of
                  grade school before people "know their letters". At least you can read the
                  paper, or even browse the Web, without knowing what Unicode is all about.)
                  Let's stay I'm operating at the grade-school teacher level, far from those
                  highbrow types who program patches to get data "the right way" from Vim to
                  the W32 or X11 clipboard and vice-versa in all possible cases of Vim
                  'encoding' and OS locale.

                  > No doubt that
                  > creating a speciality also creates jobs for specialist, and a market
                  > if you can force your big standard all around. But you also condemn
                  > the less technical countries to colonialism and all the abuse going
                  > with it.
                  >
                  > Even in the richer countries, almost every Unicode application today
                  > is more or less broken, as about nobody is able to support it as it
                  > should. No doubt that people are all excited when they get some
                  > working UTF-8, and indeed, it is much fun seeing multi-lingual Web
                  > pages. The truth is that a lot of packages boldly state they support
                  > Unicode, as soon
                  > as they handle 16-bit characters and have the usual UTF-8 and Latin-x
                  > conversions, but this is usually still quite far from the real thing.

                  Vim (well, gvim) seems to me to be handling Unicode pretty well, compared to
                  some other programls I use. It could be considered "broken" in that it
                  rejects neither overlong sequences nor invalid codes, but in an editor that
                  sort of "brokenness" can IMHO be regarded as a quality rather than a
                  blemish.

                  >
                  > For one, I'm glad that Vim "supports" Unicode, that GTK supports it,
                  > that Python supports it, that Pango exists.

                  ...that WordPad supports it (not as well as Vim unless you want proportional
                  fonts and true bidirectionality), that most web browsers understand it,
                  though not always perfectly (just try to display vocalised Arabic text in
                  Netscape 7 and you'll find out that combining characters don't combine)...

                  > Those are good news, and
                  > even extremely good news when you have no alternative handy. It might
                  > require years before German and French really leave ISO 8859-1 (or
                  > -15), and similarly for others. It will surely require a _lot_ of
                  > years before American really leave ASCII :-).

                  Well, 7-bit ASCII is left unchanged under UTF-8 isn't it? And since there
                  are no accented letters in English, except in non-English proper names and
                  in some non-assimilated foreign words like risqué, omertà, garçon, etc. ...

                  > Let's face it: without
                  > Microsoft monopoly, Unicode would likely never make it in Asian and
                  > Eastern countries. Oh, it might become strong in countries not
                  > already in control of their software engineering, Unicode will keep
                  > them captive.

                  Hm. What is better? A plethora of national encodings (sometimes 2 or 3 for a
                  single language in a single country), or a common standard? I am somewhat
                  remined of all the sorts of leagues, yards, barrels, pounds, etc. that
                  existed before an autocratic act of the French legislative body established
                  the metric system. (And to know if your national metric standard of mass is
                  up to any good you still have to arrange to have it compared, usually not
                  directly, with a certain cylinder of platinum-iridium in Sèvres, France.)
                  >
                  > My point here is not to say that Unicode is bad, but only to stress a
                  > bit that it is not the wonder that some blindly think it is. And
                  > also, on the same blow, to say that Unicode fanatics are dangerous
                  > people! :-)

                  All fanatics are dangerous, and even more so are the power-hungry who feed
                  them lies to keep them ignorant and fanatic. Yet who is going to say
                  nowadays that the metric system, or the Gregorian calendar (established by
                  papal decree) are bad? Or that there are "metric fanatics" and "Gregorian
                  fanatics"? Oh, there are some, and I know where to look...

                  >
                  > > However, for some applications it is irreplaceable. (See my homepage
                  > > http://users.skynet.be/antoine.mechelynck/ to see what I mean).
                  >
                  > Nice indeed, no doubt! Congratulations!
                  [...]

                  Thanks.

                  Best regards,
                  Tony.
                • Matthew Winn
                  ... What would be the point of that? Endianness is just a feature of the way the hardware stores the bits. ... There s also Perl and Java. Unicode support is
                  Message 8 of 24 , Feb 4, 2004
                  • 0 Attachment
                    On Wed, Feb 04, 2004 at 06:02:05AM +0100, Antoine J. Mechelynck wrote:
                    > That's 10 different encodings if you take endianness and presence or absence
                    > of a BOM into account. (Not including the proposals I've seen for mixed
                    > endianness, using a BOM to set endianness at any point in the middle of a
                    > UTF-16 text.)

                    What would be the point of that? Endianness is just a feature of the
                    way the hardware stores the bits.

                    > François Pinard <pinard@...> wrote:
                    > > For one, I'm glad that Vim "supports" Unicode, that GTK supports it,
                    > > that Python supports it, that Pango exists.
                    >
                    > ...that WordPad supports it (not as well as Vim unless you want proportional
                    > fonts and true bidirectionality), that most web browsers understand it,
                    > though not always perfectly (just try to display vocalised Arabic text in
                    > Netscape 7 and you'll find out that combining characters don't combine)...

                    There's also Perl and Java. Unicode support is getting reasonably good
                    in software. The biggest problem appears to be the availability of good
                    fonts: too often you find you can handle Unicode with no trouble at all
                    right up to the moment you want someone to be able to read it.

                    > > Those are good news, and
                    > > even extremely good news when you have no alternative handy. It might
                    > > require years before German and French really leave ISO 8859-1 (or
                    > > -15), and similarly for others. It will surely require a _lot_ of
                    > > years before American really leave ASCII :-).
                    >
                    > Well, 7-bit ASCII is left unchanged under UTF-8 isn't it? And since there
                    > are no accented letters in English, except in non-English proper names and
                    > in some non-assimilated foreign words like risqué, omertà, garçon, etc. ...

                    Accents are also used in a few cases to indicate that a sequence of
                    vowels should be pronounced separately, as in words like naïve or names
                    like Zoë. However, some Americans do seem to be resisting the move
                    away from 7-bit, and I've occasionally seen complaints from those whose
                    software still can't decode quoted-printable text.

                    --
                    Matthew Winn (matthew@...)
                  • Bram Moolenaar
                    ... A few words from the implementation side: That there is no direct mapping from the Nth character to a byte index has not much to do with Unicode but with
                    Message 9 of 24 , Feb 4, 2004
                    • 0 Attachment
                      François Pinard wrote:

                      > The most widespread Unicode illusion is still, probably, about the
                      > 1-1 correspondence between codes and characters. It requires some
                      > doing before a program can address the N'th character of an in-memory
                      > Unicode string in constant time: the used representation is usually
                      > _not_ pure Unicode. Some characters require combined forms for being
                      > produced, while others (or the same) exist pre-combined. UTF-16 has
                      > been integrated into the standard, so irrelevant to combining, some
                      > characters require two codes.

                      A few words from the implementation side: That there is no direct
                      mapping from "the Nth character" to a byte index has not much to do with
                      Unicode but with the nature of the characters. Most Asian encodings
                      have the same problem, only they compensate for that by making
                      characters twice as wide at the same time, thus at least there is a
                      one-to-one mapping with display space. That method breaks when you run
                      out of space in two-byte codes or have combining characters.

                      UTF-16 is generally looked upon as a bad thing that can't be avoided.
                      Some people (OK, lets say MS) started using 16-bit characters
                      everywhere, and later found out they can't fit everything in 16 bit and
                      also could not switch to more bits without breaking all existing
                      programs. If they would have used 32 bits from the start people would
                      have complained about a waste of memory, that would have stopped a lot
                      of people from using it. If only they would have invented UTF-8 back
                      then...

                      Vim uses UTF-8, which is the best choice for Unicode encodings (looking
                      from a programmers perspective). It has all the properties you want
                      (e.g., simple recognition of character boundaries), and still handles
                      ASCII in one byte. That's why Vim uses UTF-8 internally and converts
                      all other Unicode encodings to it.

                      > Add overhead codes for directionality, and other phenomena, you are
                      > away from simplicity.
                      [...]
                      > Unicode is really a matter of specialists, and this is shocking, given
                      > that character handling should be bread and butter of all computer
                      > programmers, whatever the country is, rich or poor.

                      Composing (aka combining) characters are already difficult to handle.
                      But they are required for a few languages (Hebrew, Thai), any encoding
                      for those languages would have the same problem.

                      Bidirectionality is extremely difficult. That's why it has not been
                      implemented in Vim yet. This is something I wish they would have put
                      outside of Unicode: Let the characters be ordered as they are to be
                      displayed, that's much simpler. The complexity is then only in
                      manipulating the text, not in displaying or cursor positioning.

                      As I understand it, the decision mostly frowned upon is the unification
                      of Asian characters. This requires marking text as "Chinese" or
                      "Japanese", otherwise you don't know how to display the text properly.
                      This can be compared to (more or less) reading English text with old
                      German characters. It's possible, but you read it letter by letter.
                      Don't know if this will prevent the use of Unicode in Asian countries,
                      since the situation with different character sets isn't better (the
                      two-byte encodings cannot be recognized automatically).

                      > No doubt that creating a speciality also creates jobs for specialist,
                      > and a market if you can force your big standard all around. But you
                      > also condemn the less technical countries to colonialism and all the
                      > abuse going with it.

                      It's the nature of the languages that make it complicated, not Unicode.
                      It certainly has nothing to do with colonialism, most of the work on
                      Unicode was done by non-western people. Americans and Europeans don't
                      know much about these things :-).

                      --
                      hundred-and-one symptoms of being an internet addict:
                      38. You wake up at 3 a.m. to go to the bathroom and stop and check your e-mail
                      on the way back to bed.

                      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                      /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
                      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                      \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                    • Antoine J. Mechelynck
                      ... All Unicode encodings except UTF-8 can be either little-endian or big-endian. In the world of Internet, files will be shared between computers whose
                      Message 10 of 24 , Feb 4, 2004
                      • 0 Attachment
                        Matthew Winn <matthew@...> wrote:
                        > On Wed, Feb 04, 2004 at 06:02:05AM +0100, Antoine J. Mechelynck wrote:
                        > > That's 10 different encodings if you take endianness and presence
                        > > or absence of a BOM into account. (Not including the proposals I've
                        > > seen for mixed endianness, using a BOM to set endianness at any
                        > > point in the middle of a UTF-16 text.)
                        >
                        > What would be the point of that? Endianness is just a feature of the
                        > way the hardware stores the bits.

                        All Unicode encodings except UTF-8 can be either little-endian or
                        big-endian. In the world of Internet, files will be shared between computers
                        whose hardware may be of different endianness. This sharing does not
                        automagically translate the file, any more than a Russian book becomes a
                        French book when I (a native French speaker) put it on my shelf (There goes
                        the POSIX requirement that encodings be determined computer-by-computer by
                        means of the locale and not file-by-file by means of "magic"). This said, I
                        don't see a purpose (other than allowing thoughtless use of the
                        concatenation program) for mixed-encoding files either. Just mentioned it in
                        passing for the sake of completeness.

                        >
                        > > François Pinard <pinard@...> wrote:
                        > > > For one, I'm glad that Vim "supports" Unicode, that GTK supports
                        > > > it,
                        > > > that Python supports it, that Pango exists.
                        > >
                        > > ...that WordPad supports it (not as well as Vim unless you want
                        > > proportional fonts and true bidirectionality), that most web
                        > > browsers understand it, though not always perfectly (just try to
                        > > display vocalised Arabic text in Netscape 7 and you'll find out
                        > > that combining characters don't combine)...
                        >
                        > There's also Perl and Java. Unicode support is getting reasonably
                        > good
                        > in software. The biggest problem appears to be the availability of
                        > good fonts: too often you find you can handle Unicode with no trouble
                        > at all
                        > right up to the moment you want someone to be able to read it.
                        >
                        > > > Those are good news, and
                        > > > even extremely good news when you have no alternative handy. It
                        > > > might require years before German and French really leave ISO
                        > > > 8859-1 (or
                        > > > -15), and similarly for others. It will surely require a _lot_ of
                        > > > years before American really leave ASCII :-).
                        > >
                        > > Well, 7-bit ASCII is left unchanged under UTF-8 isn't it? And since
                        > > there are no accented letters in English, except in non-English
                        > > proper names and in some non-assimilated foreign words like risqué,
                        > > omertà, garçon, etc. ...
                        >
                        > Accents are also used in a few cases to indicate that a sequence of
                        > vowels should be pronounced separately, as in words like naïve or
                        > names
                        > like Zoë. However, some Americans do seem to be resisting the move
                        > away from 7-bit, and I've occasionally seen complaints from those
                        > whose software still can't decode quoted-printable text.
                        >
                        > --
                        > Matthew Winn (matthew@...)

                        Well, Zoé falls in the category of what I would call "non-English proper
                        names" even if native English-speakers give that name to their daughters.
                        (Similarli Eönwë, which I have seen as a "Usenet handle".) Among
                        non-assimilated foreign common words I might add mañana (but not canyon,
                        English equivalent of Spanish cañón). Among proper names in English-speaking
                        countries originating with foreign words: Detroit (from French détroit =
                        strait, as in Strait of Dover) has lost its accent; I don't know whether
                        Bâton Rouge (from French, = red stick) still has one or not. Similarly
                        Montreal (English) vs. Montréal (French), also (IIUC) Dvorak with a caron
                        over the r for the musician but not for the computer specialist (or shall
                        way say ergologist?) etc. etc. etc.

                        Regards,
                        Tony.
                      • Tobias C. Rittweiler
                        On Wednesday, February 4, 2004 at 9:51:06 AM, ... That thingy is called a Trema. :-) However it may still fall into the category of accents, I m not sure. --
                        Message 11 of 24 , Feb 4, 2004
                        • 0 Attachment
                          On Wednesday, February 4, 2004 at 9:51:06 AM,
                          Matthew Winn <matthew@...> wrote:

                          > Accents are also used in a few cases to indicate that a sequence of
                          > vowels should be pronounced separately, as in words like naïve or names
                          > like Zoë.

                          That thingy is called a Trema. :-) However it may still fall into the
                          category of accents, I'm not sure.


                          -- tcr (tcr@...) ``Ho chresim'eidos uch ho poll'eidos sophos''
                        • Mikolaj Machowski
                          ... Also needed for some math characters. m. -- LaTeX + Vim = http://vim-latex.sourceforge.net/ Vim-list(s) Users Map: (last change 1 Feb)
                          Message 12 of 24 , Feb 4, 2004
                          • 0 Attachment
                            Dnia Wednesday 04 of February 2004 11:15, Bram Moolenaar napisał:
                            > Composing (aka combining) characters are already difficult to handle.
                            > But they are required for a few languages (Hebrew, Thai), any encoding
                            > for those languages would have the same problem.

                            Also needed for some math characters.

                            m.
                            --
                            LaTeX + Vim = http://vim-latex.sourceforge.net/
                            Vim-list(s) Users Map: (last change 1 Feb)
                            http://skawina.eu.org/mikolaj/vimlist
                            Are You There?
                          • Antoine J. Mechelynck
                            ... In modern computer parlance it s also called a diaeresis, though more precisely the diaeresis is the phonological fact of not runnung two vowels together,
                            Message 13 of 24 , Feb 4, 2004
                            • 0 Attachment
                              Tobias C. Rittweiler <tcr@...> wrote:
                              > On Wednesday, February 4, 2004 at 9:51:06 AM,
                              > Matthew Winn <matthew@...> wrote:
                              >
                              > > Accents are also used in a few cases to indicate that a sequence of
                              > > vowels should be pronounced separately, as in words like naïve or
                              > > names like Zoë.
                              >
                              > That thingy is called a Trema. :-) However it may still fall into the
                              > category of accents, I'm not sure.
                              >
                              >
                              > -- tcr (tcr@...) ``Ho chresim'eidos uch ho poll'eidos
                              > sophos''

                              In modern computer parlance it's also called a diaeresis, though more
                              precisely the diaeresis is the phonological fact of not runnung two vowels
                              together, while the trema is only the typographical sign, also used with a
                              different phonological value as the German unlaut (also in other languages
                              like Swedish, Hungarian or Turkish).

                              Anyway, when I said "accents" I actually meant "diacritical signs",
                              including not only the accents in touché, omertà and (I think) Bâton Rouge,
                              Louisiana, but also the cedilla in garçon and the tilde in mañana, so the
                              trema in naïve, Zoë (if that's the right spelling) and coöperate was meant
                              to be included.

                              Regards,
                              Tony.
                            • François Pinard
                              [Antoine J. Mechelynck] ... Only 31 bits in theory, yet of course, we allocate full words. :-) ... No, I really meant UCS-2 or UCS-4. Combinatorial
                              Message 14 of 24 , Feb 4, 2004
                              • 0 Attachment
                                [Antoine J. Mechelynck]

                                > UTF-32 (32 bits per codepoint).

                                Only 31 bits in theory, yet of course, we allocate full words. :-)

                                > > It requires some doing before a program can address the N'th
                                > > character of an in-memory Unicode string in constant time: the used
                                > > representation is usually _not_ pure Unicode.

                                > I suppose you mean a UTF-8 string [...]

                                No, I really meant UCS-2 or UCS-4. Combinatorial characters,
                                directionality marks, etc. makes it difficult to access the N'th
                                character in constant time. If you work with UCS-2 internally, than you
                                also have to account for surrogate characters if using recent standards.
                                You have to invent your own coding for doing so, and while it is natural
                                that you base it on Unicode, it is not Unicode anymore internally,
                                strictly speaking.

                                > I've seen some texts which seem to imply that all Unicode should
                                > move toward UTF-32 (where addressing the Nth character of a string
                                > _is_ straightforward, at least if combining characters and control
                                > characters are counted separately) but somehow I'm not convinced.

                                And you are right. Do not let fanatics convince you! :-) About
                                combining characters, many languages are well served by Unicode, among
                                which German and Vietnamese, say, as characters exist pre-combined, even
                                those needing two diacritical marks. However, a few years ago, Unicode
                                and W3C got together to state that no more pre-combining would get into
                                their standards, so implying that all nations not powerful (or rich)
                                enough to get their need satisfied early by Unicode, are now doomed to
                                forever complexity in the realm of Unicode and the Web. Of course,
                                you and I, as French speaking people, are nearly fully satisfied with
                                the natural mapping between Latin-1 and Unicode, and have not much to
                                complain about. But not everybody on this planet had the same luck.

                                > I seem to have become a kind of Unicode specialist for Vim at the
                                > How-To and scripting level,

                                And thanks for being there, we surely need such sources of information.

                                > Well, in CJK countries it takes all of grade school before people
                                > "know their letters".

                                On the other hand, Asian people are somewhat amused when they hear us
                                pompously label "character sets" our smallish groups of 100 glyphs :-).

                                > Vim (well, gvim) seems to me to be handling Unicode pretty well
                                > [...] [Vim] could be considered "broken" in that it rejects neither
                                > overlong sequences nor invalid codes, but in an editor that sort of
                                > "brokenness" can IMHO be regarded as a quality rather than a blemish.

                                I'm not into it yet with Vim, but surely have a very good prejudice
                                for Vim to be very usable in that area! It seems to me (still at a
                                distance) that Vim does its best to take advantage on libraries and
                                facilities available, we could not reasonably ask for more. What Vim
                                does is surely difficult enough already.

                                > Well, 7-bit ASCII is left unchanged under UTF-8 isn't it?

                                Yes, in theory. Yet, there is a running polemic about if the quote
                                (' - decimal 39) should obey ASCII, which states that it should be
                                bent to the right like an acute accent, or vertical like typographical
                                apostrophe. Latin-1 has a proper acute accent in its second half, so
                                many Latin-1 fonts put decimal 39 back to vertical. Despite Unicode
                                makes explicit the intent of being coincident with ASCII for its
                                first 128 positions, I see some well known tenants of standards being
                                dissident on this one particular point. Some people ask that we
                                change our writing habits, others suggest that fonts should rather be
                                corrected. You see, nothing is simple! :-)

                                > Hm. What is better? A plethora of national encodings (sometimes 2 or 3
                                > for a single language in a single country), or a common standard?

                                There is a distance between the dream and practice. What we really have
                                now is a plethora of national encodings, _plus_ many Unicode standards.
                                National encodings are not going away. There is even an European trend,
                                in standards committees, for creating a flurry of new "handy" 8-bit
                                subsets from Unicode, just as a way to tame Unicode into something
                                more tractable. Besides, some countries still resent the overly push
                                from the Unicode consortium towards Han unification (which was partly
                                justified by the technical limits of UCS-2 before UTF-16), and will
                                likely resist Unicode as a way to protect their culture from technology.
                                There are many smaller blunders that people told me, here and there.
                                Even if Unicode amends and evolves, some political damage will not be
                                easily forgiven. Unicode is more than a set of technological issues.

                                > Yet who is going to say nowadays that the metric system, or the
                                > Gregorian calendar (established by papal decree) are bad?

                                There are still a few non-Gregorian calendars on this planet! And the
                                metric system is seemingly not good enough yet for Americans! :-)

                                --
                                François Pinard http://www.iro.umontreal.ca/~pinard
                              • François Pinard
                                [Matthew Winn] ... I m not sure that s the biggest problem, but it is one problem. And besides fonts, we also lack of widespread combining and directional
                                Message 15 of 24 , Feb 4, 2004
                                • 0 Attachment
                                  [Matthew Winn]

                                  > The biggest problem appears to be the availability of good fonts:

                                  I'm not sure that's the biggest problem, but it is one problem. And
                                  besides fonts, we also lack of widespread combining and directional
                                  engines (etc.) at display time.

                                  I think I read that Vim supports Pango, which is good news.

                                  --
                                  François Pinard http://www.iro.umontreal.ca/~pinard
                                • François Pinard
                                  [Antoine J. Mechelynck] ... I thought that trema was not an English word, so I used diaeresis instead very systematically. Should I read that diaresis
                                  Message 16 of 24 , Feb 4, 2004
                                  • 0 Attachment
                                    [Antoine J. Mechelynck]

                                    > [...] the diaeresis is the phonological fact of not running two vowels
                                    > together, while the trema is only the typographical sign [...]

                                    I thought that "trema" was not an English word, so I used "diaeresis"
                                    instead very systematically. Should I read that "diaresis" has the
                                    meaning of French "diphtongue"? Interesting!

                                    --
                                    François Pinard http://www.iro.umontreal.ca/~pinard
                                  • Bram Moolenaar
                                    ... Yeah, but it s not working very well. Combining characters are not drawn correctly. Moving the cursor back and forth over a line changes what it shows.
                                    Message 17 of 24 , Feb 4, 2004
                                    • 0 Attachment
                                      François Pinard wrote:

                                      > > The biggest problem appears to be the availability of good fonts:
                                      >
                                      > I'm not sure that's the biggest problem, but it is one problem. And
                                      > besides fonts, we also lack of widespread combining and directional
                                      > engines (etc.) at display time.
                                      >
                                      > I think I read that Vim supports Pango, which is good news.

                                      Yeah, but it's not working very well. Combining characters are not
                                      drawn correctly. Moving the cursor back and forth over a line changes
                                      what it shows. Hopefully someone who knows Pango can look into this.
                                      The original author of this code has vanished.

                                      --
                                      The 50-50-90 rule: Anytime you have a 50-50 chance of getting
                                      something right, there's a 90% probability you'll get it wrong.

                                      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                                      /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
                                      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                                      \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                                    • François Pinard
                                      [Bram Moolenaar] ... Interesting, thanks! ... Moreover, there are scripts which use much more hairy paradigms. ... Microsoft easily imposed much worse things,
                                      Message 18 of 24 , Feb 4, 2004
                                      • 0 Attachment
                                        [Bram Moolenaar]

                                        > A few words from the implementation side: [...] That's why Vim uses
                                        > UTF-8 internally and converts all other Unicode encodings to it.

                                        Interesting, thanks!

                                        > Bidirectionality is extremely difficult.

                                        Moreover, there are scripts which use much more hairy paradigms.

                                        > If [MS] would have used 32 bits from the start people would have
                                        > complained about a waste of memory, that would have stopped a lot of
                                        > people from using it.

                                        Microsoft easily imposed much worse things, and people did not stop
                                        using Microsoft. :-)

                                        > If only they would have invented UTF-8 back then...

                                        I'm no specialist, but the first I heard of UTF-8, it was long ago in
                                        the AT&T Plan9 project, where it was called UTF-FSS at the time. There
                                        were a few conceptual flaws that were straightened in one appendix of
                                        an early draft of ISO 10646, and if I remember well, this is only later
                                        that UTF-8 made its way into Unicode. But I may remember wrongly...

                                        > As I understand it, the decision mostly frowned upon is the
                                        > unification of Asian characters. [...] Don't know if this will prevent
                                        > the use of Unicode in Asian countries [...]

                                        Some of these countries are now divided. Some people predict that
                                        Unicode will prevail in the long run, guessing that Microsoft will stick
                                        behind long enough. It would be interesting to get statistics about if
                                        Big5, JIS and all others are effectively fading, or not at all! :-)

                                        > It's the nature of the languages that make it complicated, not Unicode.

                                        I quite understand what you mean. The problem was already complex,
                                        Unicode merely adds its own set of solutions.

                                        > It certainly has nothing to do with colonialism [...]

                                        I worked a few times in African contexts, where many countries do not
                                        have their own standardised character sets. Many westerners come with
                                        their solutions, many of which are neither free nor simple. Unicode
                                        might not be the best route towards developing technical autonomy.

                                        --
                                        François Pinard http://www.iro.umontreal.ca/~pinard
                                      • Matthew Winn
                                        ... Whatever you call it, the reason I mentioned it was to disprove your claim that 7-bit suffices for English because ... As far as I m aware the use of a
                                        Message 19 of 24 , Feb 5, 2004
                                        • 0 Attachment
                                          On Wed, Feb 04, 2004 at 05:59:05PM +0100, Antoine J. Mechelynck wrote:
                                          > In modern computer parlance it's also called a diaeresis, though more
                                          > precisely the diaeresis is the phonological fact of not runnung two vowels
                                          > together, while the trema is only the typographical sign, also used with a
                                          > different phonological value as the German unlaut (also in other languages
                                          > like Swedish, Hungarian or Turkish).
                                          >
                                          > Anyway, when I said "accents" I actually meant "diacritical signs",
                                          > including not only the accents in touché, omertà and (I think) Bâton Rouge,
                                          > Louisiana, but also the cedilla in garçon and the tilde in mañana, so the
                                          > trema in naïve, Zoë (if that's the right spelling) and coöperate was meant
                                          > to be included.

                                          Whatever you call it, the reason I mentioned it was to disprove your
                                          claim that 7-bit suffices for English because

                                          > there are no accented letters in English

                                          As far as I'm aware the use of a diaeresis to mark a vowel as having
                                          a separate pronunciation is a standard part of English, not a foreign
                                          import. Other germanic languages do the same thing. Non-ASCII
                                          characters aren't common in English but they do exist.

                                          And then there's the issue of the pound sign, which certainly can't be
                                          represented in 7-bit. Years ago the workaround was to use an alternate
                                          character set which used £ to replace # but it always caused problems,
                                          and even today I occasionally run into US software which is happy to
                                          accept pound signs but displays them as hashes, or vice versa.

                                          --
                                          Matthew Winn (matthew@...)
                                        • Tobias C. Rittweiler
                                          On Wednesday, February 4, 2004 at 10:04:08 PM, ... Mhh, yes it seems like english misses trema and uses diaeresis instead with same semantics. Even though
                                          Message 20 of 24 , Feb 5, 2004
                                          • 0 Attachment
                                            On Wednesday, February 4, 2004 at 10:04:08 PM,
                                            François Pinard <pinard@...> wrote:

                                            > [Antoine J. Mechelynck]

                                            > > [...] the diaeresis is the phonological fact of not running two vowels
                                            > > together, while the trema is only the typographical sign [...]
                                            >
                                            > I thought that "trema" was not an English word, so I used "diaeresis"
                                            > instead very systematically.

                                            Mhh, yes it seems like english misses trema and uses diaeresis instead
                                            with same semantics. Even though diaeresis, as I learned it in Latin &
                                            old Greek, is *actually* the grammatical phenomenon while trema is the
                                            typographical sign---as stated by Antoine above as well.


                                            > Should I read that "diaresis" has the meaning of French "diphtongue"?

                                            God, no! It's the mere opposite (n-aï-ve vs t-oy). :-)


                                            -- tcr (tcr@...) ``Ho chresim'eidos uch ho poll'eidos sophos''
                                          • Antoine J. Mechelynck
                                            ... From: François Pinard To: Antoine J. Mechelynck Cc: Tobias C. Rittweiler
                                            Message 21 of 24 , Feb 5, 2004
                                            • 0 Attachment
                                              ----- Original Message -----
                                              From: "François Pinard" <pinard@...>
                                              To: "Antoine J. Mechelynck" <antoine.mechelynck@...>
                                              Cc: "Tobias C. Rittweiler" <tcr@...>; "Matthew Winn"
                                              <matthew@...>; <vim@...>
                                              Sent: Wednesday, February 04, 2004 10:04 PM
                                              Subject: Re: character sets


                                              > [Antoine J. Mechelynck]
                                              >
                                              > > [...] the diaeresis is the phonological fact of not running two vowels
                                              > > together, while the trema is only the typographical sign [...]
                                              >
                                              > I thought that "trema" was not an English word, so I used "diaeresis"
                                              > instead very systematically. Should I read that "diaresis" has the
                                              > meaning of French "diphtongue"? Interesting!
                                              >
                                              > --
                                              > François Pinard http://www.iro.umontreal.ca/~pinard
                                              >

                                              Not the French "diphtongue" (English "diphtong") but French "diérèse"

                                              You made me doubt, so I checked "trema" in my New Oxford's -- and they don't
                                              have it. Here's their "diaeresis" article:

                                              diaeresis (US dieresis) |> noun (pl. diaereses) *1* a mark placed over a
                                              vowel to indicate that it is sounded separately, as in /naïve, Brontë./ [en
                                              français: tréma NdlR]
                                              * [mass noun] the division of a sound into two syllables, especially by
                                              sounding a diphtong as two vowels [en français: approx. diérèse NdlR]
                                              *2* _Prosody_ a natural rythmic break in a line of verse where the end of a
                                              metrical foot coincides with the end of a phrase.
                                              -- ORIGIN late 16th cent. (denoting the division of one syllable into two):
                                              via Latin from Greek /diairesis/ 'separation', from /diairein/ 'take apart',
                                              from /dia/ 'apart' + /hairein/ 'take'.

                                              Regards,
                                              Tony.
                                            • Alejandro Lopez-Valencia
                                              ... Sorry to butt in late in this discussion, but I just couldn t let my gall bladder strangle no more. As per my Cassel s German-English, my Langenscheidts
                                              Message 22 of 24 , Feb 5, 2004
                                              • 0 Attachment
                                                Tobias C. Rittweiler scribbled on Thursday, February 05, 2004 9:42 AM:

                                                >
                                                > Mhh, yes it seems like english misses trema and uses diaeresis instead
                                                > with same semantics. Even though diaeresis, as I learned it in Latin &
                                                > old Greek, is *actually* the grammatical phenomenon while trema is the
                                                > typographical sign---as stated by Antoine above as well.

                                                Sorry to butt in late in this discussion, but I just couldn't let my gall
                                                bladder strangle no more. As per my Cassel's German-English, my
                                                Langenscheidts German-Spanish dictionaries, my sanguine German language
                                                teacher from Salzburg (yup, German passport) and my uniquely eccentric
                                                Great-Uncle from Stuttgart, Trema means diaeresis, no more, no less. ;-)

                                                BTW, German is, among Germanic languages, the one that resisted Latinization
                                                the most during the middle ages and therefore took in most Latin words late,
                                                during the Renaissance, Baroque and generally during the Enlightment and
                                                later, as part of the flowering of the intellectual culture whose
                                                revolution can be appreciated from Martin Luther through Goethe to Heidegger
                                                and Wittgenstein. Thus, most Latin words are used almost as in the original
                                                language. Trema is for the Latin "tremorus"[1]: to be brief or to quiver. As
                                                such, it is equivalent to diaeresis in a twisted sort of sense: to make a
                                                diphtonge shorternby breathing less. (Many Tremata are no more under the
                                                "Neue Regelung", if I understand the Duden correctly).

                                                On the other hand, English had an earlier desaxonification (defleaing? :-)
                                                at the hands of Willy the Conqueror and his gang from Normandy, who never
                                                spoke anything but the Normandy version of the "Langue d'Oil", presently
                                                known as French. No, he didn't drink canola, perhaps olive?

                                                And having veered off from off-topic to what my compatriot, the Nobel
                                                laurate, calls "Macondo" and literary critics "magic realism", I'll shut up
                                                now.

                                                Cheers,

                                                Alejo

                                                [1] Do forgive the spelling. To say that my Latin is rusty is an
                                                understatement.
                                              • Antoine J. Mechelynck
                                                ... According to my Petit Robert , tréma comes not from Latin tremor (quiver) but from Greek trêma (hole, markings on dice) (which explains why you use
                                                Message 23 of 24 , Feb 5, 2004
                                                • 0 Attachment
                                                  Alejandro Lopez-Valencia <dradul@...> wrote:
                                                  > Tobias C. Rittweiler scribbled on Thursday, February 05, 2004 9:42 AM:
                                                  >
                                                  > >
                                                  > > Mhh, yes it seems like english misses trema and uses diaeresis
                                                  > > instead with same semantics. Even though diaeresis, as I learned it
                                                  > > in Latin & old Greek, is *actually* the grammatical phenomenon
                                                  > > while trema is the typographical sign---as stated by Antoine above
                                                  > > as well.
                                                  >
                                                  > Sorry to butt in late in this discussion, but I just couldn't let my
                                                  > gall bladder strangle no more. As per my Cassel's German-English, my
                                                  > Langenscheidts German-Spanish dictionaries, my sanguine German
                                                  > language teacher from Salzburg (yup, German passport) and my uniquely
                                                  > eccentric Great-Uncle from Stuttgart, Trema means diaeresis, no more,
                                                  > no less. ;-)
                                                  >
                                                  > BTW, German is, among Germanic languages, the one that resisted
                                                  > Latinization the most during the middle ages and therefore took in
                                                  > most Latin words late, during the Renaissance, Baroque and generally
                                                  > during the Enlightment and later, as part of the flowering of the
                                                  > intellectual culture whose revolution can be appreciated from Martin
                                                  > Luther through Goethe to Heidegger and Wittgenstein. Thus, most Latin
                                                  > words are used almost as in the original language. Trema is for the
                                                  > Latin "tremorus"[1]: to be brief or to quiver. As such, it is
                                                  > equivalent to diaeresis in a twisted sort of sense: to make a
                                                  > diphtonge shorternby breathing less. (Many Tremata are no more under
                                                  > the "Neue Regelung", if I understand the Duden correctly).
                                                  >
                                                  > On the other hand, English had an earlier desaxonification
                                                  > (defleaing? :-) at the hands of Willy the Conqueror and his gang from
                                                  > Normandy, who never spoke anything but the Normandy version of the
                                                  > "Langue d'Oil", presently known as French. No, he didn't drink
                                                  > canola, perhaps olive?
                                                  >
                                                  > And having veered off from off-topic to what my compatriot, the Nobel
                                                  > laurate, calls "Macondo" and literary critics "magic realism", I'll
                                                  > shut up now.
                                                  >
                                                  > Cheers,
                                                  >
                                                  > Alejo
                                                  >
                                                  > [1] Do forgive the spelling. To say that my Latin is rusty is an
                                                  > understatement.

                                                  According to my "Petit Robert", tréma comes not from Latin tremor (quiver)
                                                  but from Greek trêma (hole, markings on dice) (which explains why you use
                                                  "tremata" as its plural, a typical Greek form). As for the diaeresis being
                                                  used, as one earlier poster wrote, "in various Germanic languages" to mark a
                                                  vowel that must be pronounced separately, AFAIK the only Germanic languages
                                                  using it that way are English and Dutch. In German and in some Scandinavian
                                                  languages (Swedish, at least), as well as in some non-Indo-European
                                                  languages like e.g. Finnish, Hungarian and Turkish, the same sign is used to
                                                  mean that a vowel's sound must change: the Germans call it Umlaut (literal
                                                  meaning IIUC: "by-sound").

                                                  I've read somewhere (I don't know who said it, but his mother language was
                                                  English) that the English language is one of the products of the Norman
                                                  men-at-arms' efforts to make dates with Saxon farmer's-daughters, and no
                                                  more legitimate than the other offspring of those same efforts.

                                                  As for German, even now it still hesitates between borrowing and
                                                  translating: see e.g. Telefon vs. Fernsprecher, Grammatik vs. Sprachlehre,
                                                  etc.

                                                  As I wrote in an earlier post, F. tréma means E. diaeresis, but E. diaeresis
                                                  can mean either F. tréma (Typogr.), or F. diérèse (Phon.).

                                                  Regards,
                                                  Tony.
                                                Your message has been successfully submitted and would be delivered to recipients shortly.