Loading ...
Sorry, an error occurred while loading the content.

character sets

Expand Messages
  • jonah
    Hi, I m curious about which characters can be displayed in vim, and which settings affect this. For example, I cut and pasted some html code into vim to create
    Message 1 of 24 , Feb 1, 2004
    • 0 Attachment
      Hi,

      I'm curious about which characters can be displayed in vim, and which settings
      affect this.

      For example, I cut and pasted some html code into vim to create a new html page,
      and the original code contained smart quotes. The two characters (both left and
      right smart quote) display in vim as a thick black boxes. However, when I view
      the new html page they display properly, as smart quotes. What is preventing
      them from displaying properly within vim? Is this the character set I'm using?
      The font I'm using?

      Thanks for any info,
      Jonah
    • Antoine J. Mechelynck
      ... From: jonah To: vim Sent: Sunday, February 01, 2004 8:36 PM Subject: character sets ... settings ... page,
      Message 2 of 24 , Feb 1, 2004
      • 0 Attachment
        ----- Original Message -----
        From: "jonah" <jonahgoldstein@...>
        To: "vim" <vim@...>
        Sent: Sunday, February 01, 2004 8:36 PM
        Subject: character sets


        > Hi,
        >
        > I'm curious about which characters can be displayed in vim, and which
        settings
        > affect this.
        >
        > For example, I cut and pasted some html code into vim to create a new html
        page,
        > and the original code contained smart quotes. The two characters (both
        left and
        > right smart quote) display in vim as a thick black boxes. However, when I
        view
        > the new html page they display properly, as smart quotes. What is
        preventing
        > them from displaying properly within vim? Is this the character set I'm
        using?
        > The font I'm using?
        >
        > Thanks for any info,
        > Jonah
        >
        >
        >

        If you can't display a character properly, there may be several reasons:

        o The character must exist in your current 'encoding' (global).

        o The character must exist in your current 'fileencoding' (local to
        buffer).

        o If you're using the GUI, your current 'guifont' (global) must have a
        glyph for it.

        o If you are not using the GUI, your console terminal must be set to the
        proper display codepage.

        o Your current 'fileencoding' must correspond to the character set which
        was used to save the file.

        see, among others,
        :help 'encoding'
        :help 'fileencoding'
        :help 'fileencodings'
        :help 'termencoding'
        :help 'guifont'
        :help :language

        HTH,
        Tony.
      • Jonathan D Johnston
        On Sun, 1 Feb 2004 11:36:37 -0800, jonah wrote: [...] ... Hi Jonah, What are the hex values for these smart quotes ? In Vim,
        Message 3 of 24 , Feb 3, 2004
        • 0 Attachment
          On Sun, 1 Feb 2004 11:36:37 -0800,
          "jonah" <jonahgoldstein@...> wrote:
          [...]
          > For example, I cut and pasted some html code into vim to create a new
          > html page, and the original code contained smart quotes. The two
          > characters (both left and right smart quote) display in vim as a thick
          > black boxes. However, when I view the new html page they display
          > properly, as smart quotes. What is preventing them from displaying
          > properly within vim? Is this the character set I'm using? The font
          > I'm using?

          Hi Jonah,

          What are the hex values for these "smart quotes"? In Vim, place the
          cursor on one of the black boxes and type
          :ascii
          or
          ga

          Are the hex values 0x93 (left quote) & 0x94 (right quote)? If so, these
          are Microsoft specific characters - they're not defined in latin1,
          Unicode, or any other standard character set. They should *not* be used
          on the WWW; only those browsing from a M$ OS will be able to see them.

          Last I knew, there was a good discussion about these & other M$ specific
          characters on John Walker's website. I don't have the URL, but try
          searching for
          John Walker demoroniser fourmilab

          HTH,
          Jonathan D Johnston

          ________________________________________________________________
          The best thing to hit the Internet in years - Juno SpeedBand!
          Surf the Web up to FIVE TIMES FASTER!
          Only $14.95/ month - visit www.juno.com to sign up today!
        • Antoine J. Mechelynck
          ... Everything can be represented in Unicode (or will be), i.e., not only Cuneiform, Klingon and Fëanorian letters (to name just a few), but also glyphs
          Message 4 of 24 , Feb 3, 2004
          • 0 Attachment
            Jonathan D Johnston <jdjohnston2@...> wrote:
            > On Sun, 1 Feb 2004 11:36:37 -0800,
            > "jonah" <jonahgoldstein@...> wrote:
            > [...]
            > > For example, I cut and pasted some html code into vim to create a
            > > new html page, and the original code contained smart quotes. The
            > > two characters (both left and right smart quote) display in vim as
            > > a thick black boxes. However, when I view the new html page they
            > > display properly, as smart quotes. What is preventing them from
            > > displaying properly within vim? Is this the character set I'm
            > > using? The font I'm using?
            >
            > Hi Jonah,
            >
            > What are the hex values for these "smart quotes"? In Vim, place the
            > cursor on one of the black boxes and type
            > :ascii
            > or
            > ga
            >
            > Are the hex values 0x93 (left quote) & 0x94 (right quote)? If so,
            > these are Microsoft specific characters - they're not defined in
            > latin1, Unicode, or any other standard character set. They should
            > *not* be used on the WWW; only those browsing from a M$ OS will be
            > able to see them.
            >
            > Last I knew, there was a good discussion about these & other M$
            > specific characters on John Walker's website. I don't have the URL,
            > but try searching for
            > John Walker demoroniser fourmilab
            >
            > HTH,
            > Jonathan D Johnston
            >
            > ________________________________________________________________
            > The best thing to hit the Internet in years - Juno SpeedBand!
            > Surf the Web up to FIVE TIMES FASTER!
            > Only $14.95/ month - visit www.juno.com to sign up today!

            Everything can be represented in Unicode (or will be), i.e., not only
            Cuneiform, Klingon and Fëanorian letters (to name just a few), but also
            glyphs otherwise unknown to anyone other than Microsoft users.

            W98 notepad's usual font shows characters 148 and 149 decimal as black
            boxes; when entered (cf. "help i_CTRL-V_digit") in gvim with default
            (latin1) 'encoding' then imported via the clipboard into a gvim with utf-8
            'encoding' I get glyphs non symmetrical to each other, and codepoints U+201D
            and U+2022

            I suspect what the OP calls smart quotes are what is better known as French
            quotes, or double angled brackets (« and »), codepoints U+00AB and U+00BB,
            but of course I can't be sure.

            Regards,
            Tony.
          • François Pinard
            [Antoine J. Mechelynck] ... Unicode is some kind of mirage for too many of us. In the quote above, Everything is much more diversified than one may think,
            Message 5 of 24 , Feb 3, 2004
            • 0 Attachment
              [Antoine J. Mechelynck]

              > Everything can be represented in Unicode (or will be), [...]

              Unicode is some kind of mirage for too many of us. In the quote above,
              "Everything" is much more diversified than one may think, and "will be"
              widely underestimate all the politics involved behind Unicode. Things
              are neither so pure nor so simple in practice.

              One of the problems behind the mirage is all those induced religious
              or fanatic feelings towards Unicode. Do not read me as anti-Unicode,
              there are good things in there, and I'm glad to see that it is acquiring
              better support, slowly, a bit everywhere. I support it in my little
              things whenever reasonable to do so. But the exaggerated hopes conveyed
              with Unicode also do a lot of damage, would it be only because many
              people stop looking for other or better avenues, and just stand still.

              P.S. - About Klingon :-), I read (but did not check) that it was removed
              in some later version of Unicode. Maybe it has been moved elsewhere?

              --
              François Pinard http://www.iro.umontreal.ca/~pinard
            • Antoine J. Mechelynck
              ... From: François Pinard To: Antoine J. Mechelynck Cc: ;
              Message 6 of 24 , Feb 3, 2004
              • 0 Attachment
                ----- Original Message -----
                From: "François Pinard" <pinard@...>
                To: "Antoine J. Mechelynck" <antoine.mechelynck@...>
                Cc: <jonahgoldstein@...>; "Jonathan D Johnston"
                <jdjohnston2@...>; <vim@...>
                Sent: Tuesday, February 03, 2004 9:00 PM
                Subject: Re: character sets


                > [Antoine J. Mechelynck]
                >
                > > Everything can be represented in Unicode (or will be), [...]
                >
                > Unicode is some kind of mirage for too many of us. In the quote above,
                > "Everything" is much more diversified than one may think, and "will be"
                > widely underestimate all the politics involved behind Unicode. Things
                > are neither so pure nor so simple in practice.
                >
                > One of the problems behind the mirage is all those induced religious
                > or fanatic feelings towards Unicode. Do not read me as anti-Unicode,
                > there are good things in there, and I'm glad to see that it is acquiring
                > better support, slowly, a bit everywhere. I support it in my little
                > things whenever reasonable to do so. But the exaggerated hopes conveyed
                > with Unicode also do a lot of damage, would it be only because many
                > people stop looking for other or better avenues, and just stand still.
                >
                > P.S. - About Klingon :-), I read (but did not check) that it was removed
                > in some later version of Unicode. Maybe it has been moved elsewhere?
                >
                > --
                > François Pinard http://www.iro.umontreal.ca/~pinard
                >

                Removed? I thought one of the basic tenets of Unicode was that nothing would
                ever be removed? There goes another illusion. Well, replace it by Angerthas
                or Maya, at your choice.

                As for the more basic question -- I use Unicode, but not for _everything_.
                However, for some applications it is irreplaceable. (See my homepage
                http://users.skynet.be/antoine.mechelynck/ to see what I mean). And don't
                tell me I could have used entities instead of "charset=utf-8": it's true,
                but entities are essentially a 7-bit ASCII representation of Unicode, and an
                illegible one when used for any non-Latin writing system.

                Regards,
                Tony.
              • François Pinard
                [Antoine J. Mechelynck] ... Unicode changed a lot since its inception. There has also been the influence of ISO 10646, and the practical convergence of both.
                Message 7 of 24 , Feb 3, 2004
                • 0 Attachment
                  [Antoine J. Mechelynck]

                  > I thought one of the basic tenets of Unicode was that nothing would
                  > ever be removed? There goes another illusion.

                  Unicode changed a lot since its inception. There has also been the
                  influence of ISO 10646, and the practical convergence of both.

                  The most widespread Unicode illusion is still, probably, about the
                  1-1 correspondence between codes and characters. It requires some
                  doing before a program can address the N'th character of an in-memory
                  Unicode string in constant time: the used representation is usually
                  _not_ pure Unicode. Some characters require combined forms for being
                  produced, while others (or the same) exist pre-combined. UTF-16 has
                  been integrated into the standard, so irrelevant to combining, some
                  characters require two codes. Add overhead codes for directionality,
                  and other phenomena, you are away from simplicity. Consider many levels
                  of conformance, and many editions of the Unicode standard over the
                  years, including shuffling of massive code blocks between versions, you
                  now have something really complex to implement and support.

                  Unicode is really a matter of specialists, and this is shocking, given
                  that character handling should be bread and butter of all computer
                  programmers, whatever the country is, rich or poor. No doubt that
                  creating a speciality also creates jobs for specialist, and a market if
                  you can force your big standard all around. But you also condemn the
                  less technical countries to colonialism and all the abuse going with it.

                  Even in the richer countries, almost every Unicode application today is
                  more or less broken, as about nobody is able to support it as it should.
                  No doubt that people are all excited when they get some working UTF-8,
                  and indeed, it is much fun seeing multi-lingual Web pages. The truth
                  is that a lot of packages boldly state they support Unicode, as soon
                  as they handle 16-bit characters and have the usual UTF-8 and Latin-x
                  conversions, but this is usually still quite far from the real thing.

                  For one, I'm glad that Vim "supports" Unicode, that GTK supports it,
                  that Python supports it, that Pango exists. Those are good news, and
                  even extremely good news when you have no alternative handy. It might
                  require years before German and French really leave ISO 8859-1 (or -15),
                  and similarly for others. It will surely require a _lot_ of years
                  before American really leave ASCII :-). Let's face it: without Microsoft
                  monopoly, Unicode would likely never make it in Asian and Eastern
                  countries. Oh, it might become strong in countries not already in
                  control of their software engineering, Unicode will keep them captive.

                  My point here is not to say that Unicode is bad, but only to stress a
                  bit that it is not the wonder that some blindly think it is. And also,
                  on the same blow, to say that Unicode fanatics are dangerous people! :-)

                  > However, for some applications it is irreplaceable. (See my homepage
                  > http://users.skynet.be/antoine.mechelynck/ to see what I mean).

                  Nice indeed, no doubt! Congratulations!

                  > And don't tell me I could have used entities [...]

                  I agree that entities are much abused in HTML. Even ` ', ubiquitous
                  on the Web, would be advantageously replaced by the real thing! The
                  likely reason for   is that Vim is still not popular enough! :-)

                  --
                  François Pinard http://www.iro.umontreal.ca/~pinard
                • Antoine J. Mechelynck
                  ... Unicode merely (!) ranks glyphs (and some control codes) on an integral scale going from 0 to some large number. It can be reprsented electronically in
                  Message 8 of 24 , Feb 3, 2004
                  • 0 Attachment
                    François Pinard <pinard@...> wrote:
                    > [Antoine J. Mechelynck]
                    >
                    > > I thought one of the basic tenets of Unicode was that nothing would
                    > > ever be removed? There goes another illusion.
                    >
                    > Unicode changed a lot since its inception. There has also been the
                    > influence of ISO 10646, and the practical convergence of both.
                    >
                    > The most widespread Unicode illusion is still, probably, about the
                    > 1-1 correspondence between codes and characters.

                    Unicode merely (!) ranks glyphs (and some control codes) on an integral
                    scale going from 0 to some large number. It can be reprsented electronically
                    in various ways, such as UTF-8 (from 1 to 6 bytes per codepoint in theory,
                    but no more than 4 "in any foreseeable future"), UTF-16 (1 or sometimes 2
                    16-bit words per codepoint), UTF-32 (32 bits per codepoint). The latter (in
                    either of its endian variants) is fixed-size but horribly wasteful of space.
                    That's 10 different encodings if you take endianness and presence or absence
                    of a BOM into account. (Not including the proposals I've seen for mixed
                    endianness, using a BOM to set endianness at any point in the middle of a
                    UTF-16 text.)

                    > It requires some
                    > doing before a program can address the N'th character of an in-memory
                    > Unicode string in constant time: the used representation is usually
                    > _not_ pure Unicode.

                    I suppose you mean a UTF-8 string, and I agree. Finding the Nth character of
                    an ASCII string is an addressing matter: add N to the start address of the
                    string, possibly check for out-of-bounds, that's it. Finding the Nth
                    character in a UTF-8 string requires examining the lead byte of each byte
                    sequence in turn to determine where the next character starts. And that's
                    before skipping (or not) combining characters, zero-width characters and/or
                    control characters. I've seen some texts which seem to imply that all
                    Unicode should move toward UTF-32 (where addressing the Nth character of a
                    string _is_ straightforward, at least if combining characters and control
                    characters are counted separately) but somehow I'm not convinced.

                    > Some characters require combined forms for being
                    > produced, while others (or the same) exist pre-combined. UTF-16 has
                    > been integrated into the standard, so irrelevant to combining, some
                    > characters require two codes. Add overhead codes for directionality,
                    > and other phenomena, you are away from simplicity. Consider many
                    > levels of conformance, and many editions of the Unicode standard over
                    > the years, including shuffling of massive code blocks between
                    > versions, you now have something really complex to implement and
                    > support.

                    Yeah. Just understanding it (or trying to) requires poring through sheafs of
                    nebulous, verbose documentation. (I won't say it isn't precise in its
                    long-winded way; what I'm saying is that it isn't easy to fathom.) Then
                    someone has to implement it (or try to).
                    >
                    > Unicode is really a matter of specialists, and this is shocking, given
                    > that character handling should be bread and butter of all computer
                    > programmers, whatever the country is, rich or poor.

                    I seem to have become a kind of Unicode specialist for Vim at the How-To and
                    scripting level, just because nobody wanted to do it, but I'm not gonna
                    claim I understand the system. Just that I know where to look in Vim's
                    documentation, or what settings to tweak, to make it work (somehow). I'm
                    sure there are other people hereabouts who understand it better than I.

                    It does take some getting used to. (Well, in CJK countries it takes all of
                    grade school before people "know their letters". At least you can read the
                    paper, or even browse the Web, without knowing what Unicode is all about.)
                    Let's stay I'm operating at the grade-school teacher level, far from those
                    highbrow types who program patches to get data "the right way" from Vim to
                    the W32 or X11 clipboard and vice-versa in all possible cases of Vim
                    'encoding' and OS locale.

                    > No doubt that
                    > creating a speciality also creates jobs for specialist, and a market
                    > if you can force your big standard all around. But you also condemn
                    > the less technical countries to colonialism and all the abuse going
                    > with it.
                    >
                    > Even in the richer countries, almost every Unicode application today
                    > is more or less broken, as about nobody is able to support it as it
                    > should. No doubt that people are all excited when they get some
                    > working UTF-8, and indeed, it is much fun seeing multi-lingual Web
                    > pages. The truth is that a lot of packages boldly state they support
                    > Unicode, as soon
                    > as they handle 16-bit characters and have the usual UTF-8 and Latin-x
                    > conversions, but this is usually still quite far from the real thing.

                    Vim (well, gvim) seems to me to be handling Unicode pretty well, compared to
                    some other programls I use. It could be considered "broken" in that it
                    rejects neither overlong sequences nor invalid codes, but in an editor that
                    sort of "brokenness" can IMHO be regarded as a quality rather than a
                    blemish.

                    >
                    > For one, I'm glad that Vim "supports" Unicode, that GTK supports it,
                    > that Python supports it, that Pango exists.

                    ...that WordPad supports it (not as well as Vim unless you want proportional
                    fonts and true bidirectionality), that most web browsers understand it,
                    though not always perfectly (just try to display vocalised Arabic text in
                    Netscape 7 and you'll find out that combining characters don't combine)...

                    > Those are good news, and
                    > even extremely good news when you have no alternative handy. It might
                    > require years before German and French really leave ISO 8859-1 (or
                    > -15), and similarly for others. It will surely require a _lot_ of
                    > years before American really leave ASCII :-).

                    Well, 7-bit ASCII is left unchanged under UTF-8 isn't it? And since there
                    are no accented letters in English, except in non-English proper names and
                    in some non-assimilated foreign words like risqué, omertà, garçon, etc. ...

                    > Let's face it: without
                    > Microsoft monopoly, Unicode would likely never make it in Asian and
                    > Eastern countries. Oh, it might become strong in countries not
                    > already in control of their software engineering, Unicode will keep
                    > them captive.

                    Hm. What is better? A plethora of national encodings (sometimes 2 or 3 for a
                    single language in a single country), or a common standard? I am somewhat
                    remined of all the sorts of leagues, yards, barrels, pounds, etc. that
                    existed before an autocratic act of the French legislative body established
                    the metric system. (And to know if your national metric standard of mass is
                    up to any good you still have to arrange to have it compared, usually not
                    directly, with a certain cylinder of platinum-iridium in Sèvres, France.)
                    >
                    > My point here is not to say that Unicode is bad, but only to stress a
                    > bit that it is not the wonder that some blindly think it is. And
                    > also, on the same blow, to say that Unicode fanatics are dangerous
                    > people! :-)

                    All fanatics are dangerous, and even more so are the power-hungry who feed
                    them lies to keep them ignorant and fanatic. Yet who is going to say
                    nowadays that the metric system, or the Gregorian calendar (established by
                    papal decree) are bad? Or that there are "metric fanatics" and "Gregorian
                    fanatics"? Oh, there are some, and I know where to look...

                    >
                    > > However, for some applications it is irreplaceable. (See my homepage
                    > > http://users.skynet.be/antoine.mechelynck/ to see what I mean).
                    >
                    > Nice indeed, no doubt! Congratulations!
                    [...]

                    Thanks.

                    Best regards,
                    Tony.
                  • Matthew Winn
                    ... What would be the point of that? Endianness is just a feature of the way the hardware stores the bits. ... There s also Perl and Java. Unicode support is
                    Message 9 of 24 , Feb 4, 2004
                    • 0 Attachment
                      On Wed, Feb 04, 2004 at 06:02:05AM +0100, Antoine J. Mechelynck wrote:
                      > That's 10 different encodings if you take endianness and presence or absence
                      > of a BOM into account. (Not including the proposals I've seen for mixed
                      > endianness, using a BOM to set endianness at any point in the middle of a
                      > UTF-16 text.)

                      What would be the point of that? Endianness is just a feature of the
                      way the hardware stores the bits.

                      > François Pinard <pinard@...> wrote:
                      > > For one, I'm glad that Vim "supports" Unicode, that GTK supports it,
                      > > that Python supports it, that Pango exists.
                      >
                      > ...that WordPad supports it (not as well as Vim unless you want proportional
                      > fonts and true bidirectionality), that most web browsers understand it,
                      > though not always perfectly (just try to display vocalised Arabic text in
                      > Netscape 7 and you'll find out that combining characters don't combine)...

                      There's also Perl and Java. Unicode support is getting reasonably good
                      in software. The biggest problem appears to be the availability of good
                      fonts: too often you find you can handle Unicode with no trouble at all
                      right up to the moment you want someone to be able to read it.

                      > > Those are good news, and
                      > > even extremely good news when you have no alternative handy. It might
                      > > require years before German and French really leave ISO 8859-1 (or
                      > > -15), and similarly for others. It will surely require a _lot_ of
                      > > years before American really leave ASCII :-).
                      >
                      > Well, 7-bit ASCII is left unchanged under UTF-8 isn't it? And since there
                      > are no accented letters in English, except in non-English proper names and
                      > in some non-assimilated foreign words like risqué, omertà, garçon, etc. ...

                      Accents are also used in a few cases to indicate that a sequence of
                      vowels should be pronounced separately, as in words like naïve or names
                      like Zoë. However, some Americans do seem to be resisting the move
                      away from 7-bit, and I've occasionally seen complaints from those whose
                      software still can't decode quoted-printable text.

                      --
                      Matthew Winn (matthew@...)
                    • Bram Moolenaar
                      ... A few words from the implementation side: That there is no direct mapping from the Nth character to a byte index has not much to do with Unicode but with
                      Message 10 of 24 , Feb 4, 2004
                      • 0 Attachment
                        François Pinard wrote:

                        > The most widespread Unicode illusion is still, probably, about the
                        > 1-1 correspondence between codes and characters. It requires some
                        > doing before a program can address the N'th character of an in-memory
                        > Unicode string in constant time: the used representation is usually
                        > _not_ pure Unicode. Some characters require combined forms for being
                        > produced, while others (or the same) exist pre-combined. UTF-16 has
                        > been integrated into the standard, so irrelevant to combining, some
                        > characters require two codes.

                        A few words from the implementation side: That there is no direct
                        mapping from "the Nth character" to a byte index has not much to do with
                        Unicode but with the nature of the characters. Most Asian encodings
                        have the same problem, only they compensate for that by making
                        characters twice as wide at the same time, thus at least there is a
                        one-to-one mapping with display space. That method breaks when you run
                        out of space in two-byte codes or have combining characters.

                        UTF-16 is generally looked upon as a bad thing that can't be avoided.
                        Some people (OK, lets say MS) started using 16-bit characters
                        everywhere, and later found out they can't fit everything in 16 bit and
                        also could not switch to more bits without breaking all existing
                        programs. If they would have used 32 bits from the start people would
                        have complained about a waste of memory, that would have stopped a lot
                        of people from using it. If only they would have invented UTF-8 back
                        then...

                        Vim uses UTF-8, which is the best choice for Unicode encodings (looking
                        from a programmers perspective). It has all the properties you want
                        (e.g., simple recognition of character boundaries), and still handles
                        ASCII in one byte. That's why Vim uses UTF-8 internally and converts
                        all other Unicode encodings to it.

                        > Add overhead codes for directionality, and other phenomena, you are
                        > away from simplicity.
                        [...]
                        > Unicode is really a matter of specialists, and this is shocking, given
                        > that character handling should be bread and butter of all computer
                        > programmers, whatever the country is, rich or poor.

                        Composing (aka combining) characters are already difficult to handle.
                        But they are required for a few languages (Hebrew, Thai), any encoding
                        for those languages would have the same problem.

                        Bidirectionality is extremely difficult. That's why it has not been
                        implemented in Vim yet. This is something I wish they would have put
                        outside of Unicode: Let the characters be ordered as they are to be
                        displayed, that's much simpler. The complexity is then only in
                        manipulating the text, not in displaying or cursor positioning.

                        As I understand it, the decision mostly frowned upon is the unification
                        of Asian characters. This requires marking text as "Chinese" or
                        "Japanese", otherwise you don't know how to display the text properly.
                        This can be compared to (more or less) reading English text with old
                        German characters. It's possible, but you read it letter by letter.
                        Don't know if this will prevent the use of Unicode in Asian countries,
                        since the situation with different character sets isn't better (the
                        two-byte encodings cannot be recognized automatically).

                        > No doubt that creating a speciality also creates jobs for specialist,
                        > and a market if you can force your big standard all around. But you
                        > also condemn the less technical countries to colonialism and all the
                        > abuse going with it.

                        It's the nature of the languages that make it complicated, not Unicode.
                        It certainly has nothing to do with colonialism, most of the work on
                        Unicode was done by non-western people. Americans and Europeans don't
                        know much about these things :-).

                        --
                        hundred-and-one symptoms of being an internet addict:
                        38. You wake up at 3 a.m. to go to the bathroom and stop and check your e-mail
                        on the way back to bed.

                        /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                        /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
                        \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                        \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                      • Antoine J. Mechelynck
                        ... All Unicode encodings except UTF-8 can be either little-endian or big-endian. In the world of Internet, files will be shared between computers whose
                        Message 11 of 24 , Feb 4, 2004
                        • 0 Attachment
                          Matthew Winn <matthew@...> wrote:
                          > On Wed, Feb 04, 2004 at 06:02:05AM +0100, Antoine J. Mechelynck wrote:
                          > > That's 10 different encodings if you take endianness and presence
                          > > or absence of a BOM into account. (Not including the proposals I've
                          > > seen for mixed endianness, using a BOM to set endianness at any
                          > > point in the middle of a UTF-16 text.)
                          >
                          > What would be the point of that? Endianness is just a feature of the
                          > way the hardware stores the bits.

                          All Unicode encodings except UTF-8 can be either little-endian or
                          big-endian. In the world of Internet, files will be shared between computers
                          whose hardware may be of different endianness. This sharing does not
                          automagically translate the file, any more than a Russian book becomes a
                          French book when I (a native French speaker) put it on my shelf (There goes
                          the POSIX requirement that encodings be determined computer-by-computer by
                          means of the locale and not file-by-file by means of "magic"). This said, I
                          don't see a purpose (other than allowing thoughtless use of the
                          concatenation program) for mixed-encoding files either. Just mentioned it in
                          passing for the sake of completeness.

                          >
                          > > François Pinard <pinard@...> wrote:
                          > > > For one, I'm glad that Vim "supports" Unicode, that GTK supports
                          > > > it,
                          > > > that Python supports it, that Pango exists.
                          > >
                          > > ...that WordPad supports it (not as well as Vim unless you want
                          > > proportional fonts and true bidirectionality), that most web
                          > > browsers understand it, though not always perfectly (just try to
                          > > display vocalised Arabic text in Netscape 7 and you'll find out
                          > > that combining characters don't combine)...
                          >
                          > There's also Perl and Java. Unicode support is getting reasonably
                          > good
                          > in software. The biggest problem appears to be the availability of
                          > good fonts: too often you find you can handle Unicode with no trouble
                          > at all
                          > right up to the moment you want someone to be able to read it.
                          >
                          > > > Those are good news, and
                          > > > even extremely good news when you have no alternative handy. It
                          > > > might require years before German and French really leave ISO
                          > > > 8859-1 (or
                          > > > -15), and similarly for others. It will surely require a _lot_ of
                          > > > years before American really leave ASCII :-).
                          > >
                          > > Well, 7-bit ASCII is left unchanged under UTF-8 isn't it? And since
                          > > there are no accented letters in English, except in non-English
                          > > proper names and in some non-assimilated foreign words like risqué,
                          > > omertà, garçon, etc. ...
                          >
                          > Accents are also used in a few cases to indicate that a sequence of
                          > vowels should be pronounced separately, as in words like naïve or
                          > names
                          > like Zoë. However, some Americans do seem to be resisting the move
                          > away from 7-bit, and I've occasionally seen complaints from those
                          > whose software still can't decode quoted-printable text.
                          >
                          > --
                          > Matthew Winn (matthew@...)

                          Well, Zoé falls in the category of what I would call "non-English proper
                          names" even if native English-speakers give that name to their daughters.
                          (Similarli Eönwë, which I have seen as a "Usenet handle".) Among
                          non-assimilated foreign common words I might add mañana (but not canyon,
                          English equivalent of Spanish cañón). Among proper names in English-speaking
                          countries originating with foreign words: Detroit (from French détroit =
                          strait, as in Strait of Dover) has lost its accent; I don't know whether
                          Bâton Rouge (from French, = red stick) still has one or not. Similarly
                          Montreal (English) vs. Montréal (French), also (IIUC) Dvorak with a caron
                          over the r for the musician but not for the computer specialist (or shall
                          way say ergologist?) etc. etc. etc.

                          Regards,
                          Tony.
                        • Tobias C. Rittweiler
                          On Wednesday, February 4, 2004 at 9:51:06 AM, ... That thingy is called a Trema. :-) However it may still fall into the category of accents, I m not sure. --
                          Message 12 of 24 , Feb 4, 2004
                          • 0 Attachment
                            On Wednesday, February 4, 2004 at 9:51:06 AM,
                            Matthew Winn <matthew@...> wrote:

                            > Accents are also used in a few cases to indicate that a sequence of
                            > vowels should be pronounced separately, as in words like naïve or names
                            > like Zoë.

                            That thingy is called a Trema. :-) However it may still fall into the
                            category of accents, I'm not sure.


                            -- tcr (tcr@...) ``Ho chresim'eidos uch ho poll'eidos sophos''
                          • Mikolaj Machowski
                            ... Also needed for some math characters. m. -- LaTeX + Vim = http://vim-latex.sourceforge.net/ Vim-list(s) Users Map: (last change 1 Feb)
                            Message 13 of 24 , Feb 4, 2004
                            • 0 Attachment
                              Dnia Wednesday 04 of February 2004 11:15, Bram Moolenaar napisał:
                              > Composing (aka combining) characters are already difficult to handle.
                              > But they are required for a few languages (Hebrew, Thai), any encoding
                              > for those languages would have the same problem.

                              Also needed for some math characters.

                              m.
                              --
                              LaTeX + Vim = http://vim-latex.sourceforge.net/
                              Vim-list(s) Users Map: (last change 1 Feb)
                              http://skawina.eu.org/mikolaj/vimlist
                              Are You There?
                            • Antoine J. Mechelynck
                              ... In modern computer parlance it s also called a diaeresis, though more precisely the diaeresis is the phonological fact of not runnung two vowels together,
                              Message 14 of 24 , Feb 4, 2004
                              • 0 Attachment
                                Tobias C. Rittweiler <tcr@...> wrote:
                                > On Wednesday, February 4, 2004 at 9:51:06 AM,
                                > Matthew Winn <matthew@...> wrote:
                                >
                                > > Accents are also used in a few cases to indicate that a sequence of
                                > > vowels should be pronounced separately, as in words like naïve or
                                > > names like Zoë.
                                >
                                > That thingy is called a Trema. :-) However it may still fall into the
                                > category of accents, I'm not sure.
                                >
                                >
                                > -- tcr (tcr@...) ``Ho chresim'eidos uch ho poll'eidos
                                > sophos''

                                In modern computer parlance it's also called a diaeresis, though more
                                precisely the diaeresis is the phonological fact of not runnung two vowels
                                together, while the trema is only the typographical sign, also used with a
                                different phonological value as the German unlaut (also in other languages
                                like Swedish, Hungarian or Turkish).

                                Anyway, when I said "accents" I actually meant "diacritical signs",
                                including not only the accents in touché, omertà and (I think) Bâton Rouge,
                                Louisiana, but also the cedilla in garçon and the tilde in mañana, so the
                                trema in naïve, Zoë (if that's the right spelling) and coöperate was meant
                                to be included.

                                Regards,
                                Tony.
                              • François Pinard
                                [Antoine J. Mechelynck] ... Only 31 bits in theory, yet of course, we allocate full words. :-) ... No, I really meant UCS-2 or UCS-4. Combinatorial
                                Message 15 of 24 , Feb 4, 2004
                                • 0 Attachment
                                  [Antoine J. Mechelynck]

                                  > UTF-32 (32 bits per codepoint).

                                  Only 31 bits in theory, yet of course, we allocate full words. :-)

                                  > > It requires some doing before a program can address the N'th
                                  > > character of an in-memory Unicode string in constant time: the used
                                  > > representation is usually _not_ pure Unicode.

                                  > I suppose you mean a UTF-8 string [...]

                                  No, I really meant UCS-2 or UCS-4. Combinatorial characters,
                                  directionality marks, etc. makes it difficult to access the N'th
                                  character in constant time. If you work with UCS-2 internally, than you
                                  also have to account for surrogate characters if using recent standards.
                                  You have to invent your own coding for doing so, and while it is natural
                                  that you base it on Unicode, it is not Unicode anymore internally,
                                  strictly speaking.

                                  > I've seen some texts which seem to imply that all Unicode should
                                  > move toward UTF-32 (where addressing the Nth character of a string
                                  > _is_ straightforward, at least if combining characters and control
                                  > characters are counted separately) but somehow I'm not convinced.

                                  And you are right. Do not let fanatics convince you! :-) About
                                  combining characters, many languages are well served by Unicode, among
                                  which German and Vietnamese, say, as characters exist pre-combined, even
                                  those needing two diacritical marks. However, a few years ago, Unicode
                                  and W3C got together to state that no more pre-combining would get into
                                  their standards, so implying that all nations not powerful (or rich)
                                  enough to get their need satisfied early by Unicode, are now doomed to
                                  forever complexity in the realm of Unicode and the Web. Of course,
                                  you and I, as French speaking people, are nearly fully satisfied with
                                  the natural mapping between Latin-1 and Unicode, and have not much to
                                  complain about. But not everybody on this planet had the same luck.

                                  > I seem to have become a kind of Unicode specialist for Vim at the
                                  > How-To and scripting level,

                                  And thanks for being there, we surely need such sources of information.

                                  > Well, in CJK countries it takes all of grade school before people
                                  > "know their letters".

                                  On the other hand, Asian people are somewhat amused when they hear us
                                  pompously label "character sets" our smallish groups of 100 glyphs :-).

                                  > Vim (well, gvim) seems to me to be handling Unicode pretty well
                                  > [...] [Vim] could be considered "broken" in that it rejects neither
                                  > overlong sequences nor invalid codes, but in an editor that sort of
                                  > "brokenness" can IMHO be regarded as a quality rather than a blemish.

                                  I'm not into it yet with Vim, but surely have a very good prejudice
                                  for Vim to be very usable in that area! It seems to me (still at a
                                  distance) that Vim does its best to take advantage on libraries and
                                  facilities available, we could not reasonably ask for more. What Vim
                                  does is surely difficult enough already.

                                  > Well, 7-bit ASCII is left unchanged under UTF-8 isn't it?

                                  Yes, in theory. Yet, there is a running polemic about if the quote
                                  (' - decimal 39) should obey ASCII, which states that it should be
                                  bent to the right like an acute accent, or vertical like typographical
                                  apostrophe. Latin-1 has a proper acute accent in its second half, so
                                  many Latin-1 fonts put decimal 39 back to vertical. Despite Unicode
                                  makes explicit the intent of being coincident with ASCII for its
                                  first 128 positions, I see some well known tenants of standards being
                                  dissident on this one particular point. Some people ask that we
                                  change our writing habits, others suggest that fonts should rather be
                                  corrected. You see, nothing is simple! :-)

                                  > Hm. What is better? A plethora of national encodings (sometimes 2 or 3
                                  > for a single language in a single country), or a common standard?

                                  There is a distance between the dream and practice. What we really have
                                  now is a plethora of national encodings, _plus_ many Unicode standards.
                                  National encodings are not going away. There is even an European trend,
                                  in standards committees, for creating a flurry of new "handy" 8-bit
                                  subsets from Unicode, just as a way to tame Unicode into something
                                  more tractable. Besides, some countries still resent the overly push
                                  from the Unicode consortium towards Han unification (which was partly
                                  justified by the technical limits of UCS-2 before UTF-16), and will
                                  likely resist Unicode as a way to protect their culture from technology.
                                  There are many smaller blunders that people told me, here and there.
                                  Even if Unicode amends and evolves, some political damage will not be
                                  easily forgiven. Unicode is more than a set of technological issues.

                                  > Yet who is going to say nowadays that the metric system, or the
                                  > Gregorian calendar (established by papal decree) are bad?

                                  There are still a few non-Gregorian calendars on this planet! And the
                                  metric system is seemingly not good enough yet for Americans! :-)

                                  --
                                  François Pinard http://www.iro.umontreal.ca/~pinard
                                • François Pinard
                                  [Matthew Winn] ... I m not sure that s the biggest problem, but it is one problem. And besides fonts, we also lack of widespread combining and directional
                                  Message 16 of 24 , Feb 4, 2004
                                  • 0 Attachment
                                    [Matthew Winn]

                                    > The biggest problem appears to be the availability of good fonts:

                                    I'm not sure that's the biggest problem, but it is one problem. And
                                    besides fonts, we also lack of widespread combining and directional
                                    engines (etc.) at display time.

                                    I think I read that Vim supports Pango, which is good news.

                                    --
                                    François Pinard http://www.iro.umontreal.ca/~pinard
                                  • François Pinard
                                    [Antoine J. Mechelynck] ... I thought that trema was not an English word, so I used diaeresis instead very systematically. Should I read that diaresis
                                    Message 17 of 24 , Feb 4, 2004
                                    • 0 Attachment
                                      [Antoine J. Mechelynck]

                                      > [...] the diaeresis is the phonological fact of not running two vowels
                                      > together, while the trema is only the typographical sign [...]

                                      I thought that "trema" was not an English word, so I used "diaeresis"
                                      instead very systematically. Should I read that "diaresis" has the
                                      meaning of French "diphtongue"? Interesting!

                                      --
                                      François Pinard http://www.iro.umontreal.ca/~pinard
                                    • Bram Moolenaar
                                      ... Yeah, but it s not working very well. Combining characters are not drawn correctly. Moving the cursor back and forth over a line changes what it shows.
                                      Message 18 of 24 , Feb 4, 2004
                                      • 0 Attachment
                                        François Pinard wrote:

                                        > > The biggest problem appears to be the availability of good fonts:
                                        >
                                        > I'm not sure that's the biggest problem, but it is one problem. And
                                        > besides fonts, we also lack of widespread combining and directional
                                        > engines (etc.) at display time.
                                        >
                                        > I think I read that Vim supports Pango, which is good news.

                                        Yeah, but it's not working very well. Combining characters are not
                                        drawn correctly. Moving the cursor back and forth over a line changes
                                        what it shows. Hopefully someone who knows Pango can look into this.
                                        The original author of this code has vanished.

                                        --
                                        The 50-50-90 rule: Anytime you have a 50-50 chance of getting
                                        something right, there's a 90% probability you'll get it wrong.

                                        /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                                        /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
                                        \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                                        \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                                      • François Pinard
                                        [Bram Moolenaar] ... Interesting, thanks! ... Moreover, there are scripts which use much more hairy paradigms. ... Microsoft easily imposed much worse things,
                                        Message 19 of 24 , Feb 4, 2004
                                        • 0 Attachment
                                          [Bram Moolenaar]

                                          > A few words from the implementation side: [...] That's why Vim uses
                                          > UTF-8 internally and converts all other Unicode encodings to it.

                                          Interesting, thanks!

                                          > Bidirectionality is extremely difficult.

                                          Moreover, there are scripts which use much more hairy paradigms.

                                          > If [MS] would have used 32 bits from the start people would have
                                          > complained about a waste of memory, that would have stopped a lot of
                                          > people from using it.

                                          Microsoft easily imposed much worse things, and people did not stop
                                          using Microsoft. :-)

                                          > If only they would have invented UTF-8 back then...

                                          I'm no specialist, but the first I heard of UTF-8, it was long ago in
                                          the AT&T Plan9 project, where it was called UTF-FSS at the time. There
                                          were a few conceptual flaws that were straightened in one appendix of
                                          an early draft of ISO 10646, and if I remember well, this is only later
                                          that UTF-8 made its way into Unicode. But I may remember wrongly...

                                          > As I understand it, the decision mostly frowned upon is the
                                          > unification of Asian characters. [...] Don't know if this will prevent
                                          > the use of Unicode in Asian countries [...]

                                          Some of these countries are now divided. Some people predict that
                                          Unicode will prevail in the long run, guessing that Microsoft will stick
                                          behind long enough. It would be interesting to get statistics about if
                                          Big5, JIS and all others are effectively fading, or not at all! :-)

                                          > It's the nature of the languages that make it complicated, not Unicode.

                                          I quite understand what you mean. The problem was already complex,
                                          Unicode merely adds its own set of solutions.

                                          > It certainly has nothing to do with colonialism [...]

                                          I worked a few times in African contexts, where many countries do not
                                          have their own standardised character sets. Many westerners come with
                                          their solutions, many of which are neither free nor simple. Unicode
                                          might not be the best route towards developing technical autonomy.

                                          --
                                          François Pinard http://www.iro.umontreal.ca/~pinard
                                        • Matthew Winn
                                          ... Whatever you call it, the reason I mentioned it was to disprove your claim that 7-bit suffices for English because ... As far as I m aware the use of a
                                          Message 20 of 24 , Feb 5, 2004
                                          • 0 Attachment
                                            On Wed, Feb 04, 2004 at 05:59:05PM +0100, Antoine J. Mechelynck wrote:
                                            > In modern computer parlance it's also called a diaeresis, though more
                                            > precisely the diaeresis is the phonological fact of not runnung two vowels
                                            > together, while the trema is only the typographical sign, also used with a
                                            > different phonological value as the German unlaut (also in other languages
                                            > like Swedish, Hungarian or Turkish).
                                            >
                                            > Anyway, when I said "accents" I actually meant "diacritical signs",
                                            > including not only the accents in touché, omertà and (I think) Bâton Rouge,
                                            > Louisiana, but also the cedilla in garçon and the tilde in mañana, so the
                                            > trema in naïve, Zoë (if that's the right spelling) and coöperate was meant
                                            > to be included.

                                            Whatever you call it, the reason I mentioned it was to disprove your
                                            claim that 7-bit suffices for English because

                                            > there are no accented letters in English

                                            As far as I'm aware the use of a diaeresis to mark a vowel as having
                                            a separate pronunciation is a standard part of English, not a foreign
                                            import. Other germanic languages do the same thing. Non-ASCII
                                            characters aren't common in English but they do exist.

                                            And then there's the issue of the pound sign, which certainly can't be
                                            represented in 7-bit. Years ago the workaround was to use an alternate
                                            character set which used £ to replace # but it always caused problems,
                                            and even today I occasionally run into US software which is happy to
                                            accept pound signs but displays them as hashes, or vice versa.

                                            --
                                            Matthew Winn (matthew@...)
                                          • Tobias C. Rittweiler
                                            On Wednesday, February 4, 2004 at 10:04:08 PM, ... Mhh, yes it seems like english misses trema and uses diaeresis instead with same semantics. Even though
                                            Message 21 of 24 , Feb 5, 2004
                                            • 0 Attachment
                                              On Wednesday, February 4, 2004 at 10:04:08 PM,
                                              François Pinard <pinard@...> wrote:

                                              > [Antoine J. Mechelynck]

                                              > > [...] the diaeresis is the phonological fact of not running two vowels
                                              > > together, while the trema is only the typographical sign [...]
                                              >
                                              > I thought that "trema" was not an English word, so I used "diaeresis"
                                              > instead very systematically.

                                              Mhh, yes it seems like english misses trema and uses diaeresis instead
                                              with same semantics. Even though diaeresis, as I learned it in Latin &
                                              old Greek, is *actually* the grammatical phenomenon while trema is the
                                              typographical sign---as stated by Antoine above as well.


                                              > Should I read that "diaresis" has the meaning of French "diphtongue"?

                                              God, no! It's the mere opposite (n-aï-ve vs t-oy). :-)


                                              -- tcr (tcr@...) ``Ho chresim'eidos uch ho poll'eidos sophos''
                                            • Antoine J. Mechelynck
                                              ... From: François Pinard To: Antoine J. Mechelynck Cc: Tobias C. Rittweiler
                                              Message 22 of 24 , Feb 5, 2004
                                              • 0 Attachment
                                                ----- Original Message -----
                                                From: "François Pinard" <pinard@...>
                                                To: "Antoine J. Mechelynck" <antoine.mechelynck@...>
                                                Cc: "Tobias C. Rittweiler" <tcr@...>; "Matthew Winn"
                                                <matthew@...>; <vim@...>
                                                Sent: Wednesday, February 04, 2004 10:04 PM
                                                Subject: Re: character sets


                                                > [Antoine J. Mechelynck]
                                                >
                                                > > [...] the diaeresis is the phonological fact of not running two vowels
                                                > > together, while the trema is only the typographical sign [...]
                                                >
                                                > I thought that "trema" was not an English word, so I used "diaeresis"
                                                > instead very systematically. Should I read that "diaresis" has the
                                                > meaning of French "diphtongue"? Interesting!
                                                >
                                                > --
                                                > François Pinard http://www.iro.umontreal.ca/~pinard
                                                >

                                                Not the French "diphtongue" (English "diphtong") but French "diérèse"

                                                You made me doubt, so I checked "trema" in my New Oxford's -- and they don't
                                                have it. Here's their "diaeresis" article:

                                                diaeresis (US dieresis) |> noun (pl. diaereses) *1* a mark placed over a
                                                vowel to indicate that it is sounded separately, as in /naïve, Brontë./ [en
                                                français: tréma NdlR]
                                                * [mass noun] the division of a sound into two syllables, especially by
                                                sounding a diphtong as two vowels [en français: approx. diérèse NdlR]
                                                *2* _Prosody_ a natural rythmic break in a line of verse where the end of a
                                                metrical foot coincides with the end of a phrase.
                                                -- ORIGIN late 16th cent. (denoting the division of one syllable into two):
                                                via Latin from Greek /diairesis/ 'separation', from /diairein/ 'take apart',
                                                from /dia/ 'apart' + /hairein/ 'take'.

                                                Regards,
                                                Tony.
                                              • Alejandro Lopez-Valencia
                                                ... Sorry to butt in late in this discussion, but I just couldn t let my gall bladder strangle no more. As per my Cassel s German-English, my Langenscheidts
                                                Message 23 of 24 , Feb 5, 2004
                                                • 0 Attachment
                                                  Tobias C. Rittweiler scribbled on Thursday, February 05, 2004 9:42 AM:

                                                  >
                                                  > Mhh, yes it seems like english misses trema and uses diaeresis instead
                                                  > with same semantics. Even though diaeresis, as I learned it in Latin &
                                                  > old Greek, is *actually* the grammatical phenomenon while trema is the
                                                  > typographical sign---as stated by Antoine above as well.

                                                  Sorry to butt in late in this discussion, but I just couldn't let my gall
                                                  bladder strangle no more. As per my Cassel's German-English, my
                                                  Langenscheidts German-Spanish dictionaries, my sanguine German language
                                                  teacher from Salzburg (yup, German passport) and my uniquely eccentric
                                                  Great-Uncle from Stuttgart, Trema means diaeresis, no more, no less. ;-)

                                                  BTW, German is, among Germanic languages, the one that resisted Latinization
                                                  the most during the middle ages and therefore took in most Latin words late,
                                                  during the Renaissance, Baroque and generally during the Enlightment and
                                                  later, as part of the flowering of the intellectual culture whose
                                                  revolution can be appreciated from Martin Luther through Goethe to Heidegger
                                                  and Wittgenstein. Thus, most Latin words are used almost as in the original
                                                  language. Trema is for the Latin "tremorus"[1]: to be brief or to quiver. As
                                                  such, it is equivalent to diaeresis in a twisted sort of sense: to make a
                                                  diphtonge shorternby breathing less. (Many Tremata are no more under the
                                                  "Neue Regelung", if I understand the Duden correctly).

                                                  On the other hand, English had an earlier desaxonification (defleaing? :-)
                                                  at the hands of Willy the Conqueror and his gang from Normandy, who never
                                                  spoke anything but the Normandy version of the "Langue d'Oil", presently
                                                  known as French. No, he didn't drink canola, perhaps olive?

                                                  And having veered off from off-topic to what my compatriot, the Nobel
                                                  laurate, calls "Macondo" and literary critics "magic realism", I'll shut up
                                                  now.

                                                  Cheers,

                                                  Alejo

                                                  [1] Do forgive the spelling. To say that my Latin is rusty is an
                                                  understatement.
                                                • Antoine J. Mechelynck
                                                  ... According to my Petit Robert , tréma comes not from Latin tremor (quiver) but from Greek trêma (hole, markings on dice) (which explains why you use
                                                  Message 24 of 24 , Feb 5, 2004
                                                  • 0 Attachment
                                                    Alejandro Lopez-Valencia <dradul@...> wrote:
                                                    > Tobias C. Rittweiler scribbled on Thursday, February 05, 2004 9:42 AM:
                                                    >
                                                    > >
                                                    > > Mhh, yes it seems like english misses trema and uses diaeresis
                                                    > > instead with same semantics. Even though diaeresis, as I learned it
                                                    > > in Latin & old Greek, is *actually* the grammatical phenomenon
                                                    > > while trema is the typographical sign---as stated by Antoine above
                                                    > > as well.
                                                    >
                                                    > Sorry to butt in late in this discussion, but I just couldn't let my
                                                    > gall bladder strangle no more. As per my Cassel's German-English, my
                                                    > Langenscheidts German-Spanish dictionaries, my sanguine German
                                                    > language teacher from Salzburg (yup, German passport) and my uniquely
                                                    > eccentric Great-Uncle from Stuttgart, Trema means diaeresis, no more,
                                                    > no less. ;-)
                                                    >
                                                    > BTW, German is, among Germanic languages, the one that resisted
                                                    > Latinization the most during the middle ages and therefore took in
                                                    > most Latin words late, during the Renaissance, Baroque and generally
                                                    > during the Enlightment and later, as part of the flowering of the
                                                    > intellectual culture whose revolution can be appreciated from Martin
                                                    > Luther through Goethe to Heidegger and Wittgenstein. Thus, most Latin
                                                    > words are used almost as in the original language. Trema is for the
                                                    > Latin "tremorus"[1]: to be brief or to quiver. As such, it is
                                                    > equivalent to diaeresis in a twisted sort of sense: to make a
                                                    > diphtonge shorternby breathing less. (Many Tremata are no more under
                                                    > the "Neue Regelung", if I understand the Duden correctly).
                                                    >
                                                    > On the other hand, English had an earlier desaxonification
                                                    > (defleaing? :-) at the hands of Willy the Conqueror and his gang from
                                                    > Normandy, who never spoke anything but the Normandy version of the
                                                    > "Langue d'Oil", presently known as French. No, he didn't drink
                                                    > canola, perhaps olive?
                                                    >
                                                    > And having veered off from off-topic to what my compatriot, the Nobel
                                                    > laurate, calls "Macondo" and literary critics "magic realism", I'll
                                                    > shut up now.
                                                    >
                                                    > Cheers,
                                                    >
                                                    > Alejo
                                                    >
                                                    > [1] Do forgive the spelling. To say that my Latin is rusty is an
                                                    > understatement.

                                                    According to my "Petit Robert", tréma comes not from Latin tremor (quiver)
                                                    but from Greek trêma (hole, markings on dice) (which explains why you use
                                                    "tremata" as its plural, a typical Greek form). As for the diaeresis being
                                                    used, as one earlier poster wrote, "in various Germanic languages" to mark a
                                                    vowel that must be pronounced separately, AFAIK the only Germanic languages
                                                    using it that way are English and Dutch. In German and in some Scandinavian
                                                    languages (Swedish, at least), as well as in some non-Indo-European
                                                    languages like e.g. Finnish, Hungarian and Turkish, the same sign is used to
                                                    mean that a vowel's sound must change: the Germans call it Umlaut (literal
                                                    meaning IIUC: "by-sound").

                                                    I've read somewhere (I don't know who said it, but his mother language was
                                                    English) that the English language is one of the products of the Norman
                                                    men-at-arms' efforts to make dates with Saxon farmer's-daughters, and no
                                                    more legitimate than the other offspring of those same efforts.

                                                    As for German, even now it still hesitates between borrowing and
                                                    translating: see e.g. Telefon vs. Fernsprecher, Grammatik vs. Sprachlehre,
                                                    etc.

                                                    As I wrote in an earlier post, F. tréma means E. diaeresis, but E. diaeresis
                                                    can mean either F. tréma (Typogr.), or F. diérèse (Phon.).

                                                    Regards,
                                                    Tony.
                                                  Your message has been successfully submitted and would be delivered to recipients shortly.