Loading ...
Sorry, an error occurred while loading the content.

unicode conversions

Expand Messages
  • seer26@telocity.com
    thanks for the response bram, ... Hrm, I was under the impression that converting from non-unicode to unicode was always possible. And the reverse is possible
    Message 1 of 6 , Aug 16, 2002
    • 0 Attachment
      thanks for the response bram,

      > > I was thinking _VIM_TEXT could be a 1-byte motion-type,
      > > then utf-8 text. There is no backwards compatibility to
      > > break, because there is no standard atm, afaik.
      >
      > Conversion between 'encoding' and utf-8 will not always be possible.
      > It's not backwards compatible either, since an older Vim expects no
      > conversion (assuming you run two versions of Vim for some reason).

      Hrm, I was under the impression that converting from
      non-unicode to unicode was always possible. And the reverse is possible
      to the extent that the desired characters are representable in the
      destination encoding: if they are not then pasting whats left is
      still valid; moreso than just dumping in the text arbitrarily (as
      basically binary).

      Unfortunately, while experimenting with my system iconv, it appears
      to instead stop when there is no destination encoding for a character,
      rather than allowing a fallback to a default character. This can still
      work: you just get the first portion of the string which is valid in the
      destination encoding, and none after that, which is at least as valid
      as ramming whatever's there in as binary imo.

      Other converters are configurable in this sense: you could set them up
      to convert between two encodings and to use a "?" for example when
      there is no destination representation.

      > This would require a new atom to be used for the clipboard, which
      > includes the name of the encoding. The receiver of the clipboard can
      > then attempt conversion. This isn't very difficult.

      i was thinking that such machinery would be redundant, in that unicode
      wans supposed to be canonical...

      Does anyone else agree/disagree?
    • Glenn Maynard
      ... Not all platforms Vim runs on have iconv. Of course, it s fine to use it as much as needed when it s available, but you can t depend on it. ... You can
      Message 2 of 6 , Aug 16, 2002
      • 0 Attachment
        On Fri, Aug 16, 2002 at 12:56:35PM -0400, seer26@... wrote:
        > Hrm, I was under the impression that converting from
        > non-unicode to unicode was always possible. And the reverse is possible
        > to the extent that the desired characters are representable in the
        > destination encoding: if they are not then pasting whats left is
        > still valid; moreso than just dumping in the text arbitrarily (as
        > basically binary).

        Not all platforms Vim runs on have iconv.

        Of course, it's fine to use it as much as needed when it's available,
        but you can't depend on it.

        > Unfortunately, while experimenting with my system iconv, it appears
        > to instead stop when there is no destination encoding for a character,
        > rather than allowing a fallback to a default character. This can still
        > work: you just get the first portion of the string which is valid in the
        > destination encoding, and none after that, which is at least as valid
        > as ramming whatever's there in as binary imo.

        You can call iconv per-character if you want; I don't think that's
        *too* slow.

        You can probably also call iconv on a block of data, and if it returns
        errno==EILSEQ, skip a character in the input, drop a '?' in the output
        and keep going.

        > i was thinking that such machinery would be redundant, in that unicode
        > wans supposed to be canonical...
        >
        > Does anyone else agree/disagree?

        Sure, it's canonical. It's not always available.

        (Well, it's available everywhere I care about, but it's Bram's program,
        not mine. Luckily for people running on obsolete systems. :)

        --
        Glenn Maynard
      • Tony Mechelynck
        I suppose you guys don t mind but sometimes Unicode-to-other conversion will give ambiguous, non-uniform or dubious results such as Greek eta becoming Latin y
        Message 3 of 6 , Aug 16, 2002
        • 0 Attachment
          I suppose you guys don't mind but sometimes Unicode-to-other conversion will
          give ambiguous, non-uniform or dubious results such as Greek eta becoming
          Latin y (while in many cases Latin e would be more appropriate) or
          circumflex accents stripped from Esperanto consonants instead of being
          replaced by postfixed h as the standard mandates. I guess manual
          intervention at runtime would be unfeasible anyway... Maybe 15 years from
          now gvim will have its own iconv (or whatever) and not rely of
          sometimes-broken software coming with disparate OSs...

          Just a thought
          Tony.

          ----- Original Message -----
          From: "Glenn Maynard" <glenn@...>
          To: <vim-multibyte@...>
          Sent: Friday, August 16, 2002 11:09 PM
          Subject: Re: unicode conversions


          > On Fri, Aug 16, 2002 at 12:56:35PM -0400, seer26@... wrote:
          > > Hrm, I was under the impression that converting from
          > > non-unicode to unicode was always possible. And the reverse is possible
          > > to the extent that the desired characters are representable in the
          > > destination encoding: if they are not then pasting whats left is
          > > still valid; moreso than just dumping in the text arbitrarily (as
          > > basically binary).
          >
          > Not all platforms Vim runs on have iconv.
          >
          > Of course, it's fine to use it as much as needed when it's available,
          > but you can't depend on it.
          >
          > > Unfortunately, while experimenting with my system iconv, it appears
          > > to instead stop when there is no destination encoding for a character,
          > > rather than allowing a fallback to a default character. This can still
          > > work: you just get the first portion of the string which is valid in the
          > > destination encoding, and none after that, which is at least as valid
          > > as ramming whatever's there in as binary imo.
          >
          > You can call iconv per-character if you want; I don't think that's
          > *too* slow.
          >
          > You can probably also call iconv on a block of data, and if it returns
          > errno==EILSEQ, skip a character in the input, drop a '?' in the output
          > and keep going.
          >
          > > i was thinking that such machinery would be redundant, in that unicode
          > > wans supposed to be canonical...
          > >
          > > Does anyone else agree/disagree?
          >
          > Sure, it's canonical. It's not always available.
          >
          > (Well, it's available everywhere I care about, but it's Bram's program,
          > not mine. Luckily for people running on obsolete systems. :)
          >
          > --
          > Glenn Maynard
          >
          >
        • Glenn Maynard
          ... Converting from Unicode to ISO-8859-7 converts Greek eta (U+0397) to Greek eta (0xC7). Converting from Unicode to ISO-8859-1 does the same thing to eta as
          Message 4 of 6 , Aug 16, 2002
          • 0 Attachment
            On Sat, Aug 17, 2002 at 12:49:25AM +0200, Tony Mechelynck wrote:
            > I suppose you guys don't mind but sometimes Unicode-to-other conversion will
            > give ambiguous, non-uniform or dubious results such as Greek eta becoming
            > Latin y (while in many cases Latin e would be more appropriate) or

            Converting from Unicode to ISO-8859-7 converts Greek eta (U+0397) to
            Greek eta (0xC7).

            Converting from Unicode to ISO-8859-1 does the same thing to eta as
            converting from ISO-8859-7 to ISO-8859-1 does: it fails, since the
            character isn't available:

            07:12 PM glenn@.../2 [~] iconv -t ISO-8859-7 | iconv -f ISO-8859-7
            -t ISO-8859-1//TRANSLIT
            Η
            ?

            If iconv provided an inaccurate translit for UTF8->8859-1, it'd probably
            be inaccurate for 8859-7, too.

            So, if this is a problem, I don't see how it's any worse with Unicode.
            Could you expound a bit?

            --
            Glenn Maynard
          • Tony Mechelynck
            Well, maybe not iconv then. (Guess I m a little out of my depth in this ml.) Or is it some browsers trying to improve on non-available from-unicode
            Message 5 of 6 , Aug 16, 2002
            • 0 Attachment
              Well, maybe not iconv then. (Guess I'm a little out of my depth in this ml.)
              Or is it some browsers trying to "improve" on non-available from-unicode
              conversions? I know Lynx converts Unicode eta to cp437 y and I think
              Konqueror does the same when its default display is iso-8859-1 but I'm less
              sure of that. Yesterday I sent a Unicode mail with a sentence in Esperanto
              and got it back quoted in some non-unicode reply mail from the US with the
              circumflexes stripped but the letters otherwise OK. Just haphazard symptoms,
              as you see. Maybe I shoulda kept silent.

              Tony.

              ----- Original Message -----
              From: "Glenn Maynard" <glenn@...>
              To: <vim-multibyte@...>
              Sent: Saturday, August 17, 2002 1:16 AM
              Subject: Re: unicode conversions


              > On Sat, Aug 17, 2002 at 12:49:25AM +0200, Tony Mechelynck wrote:
              > > I suppose you guys don't mind but sometimes Unicode-to-other conversion
              will
              > > give ambiguous, non-uniform or dubious results such as Greek eta
              becoming
              > > Latin y (while in many cases Latin e would be more appropriate) or
              >
              > Converting from Unicode to ISO-8859-7 converts Greek eta (U+0397) to
              > Greek eta (0xC7).
              >
              > Converting from Unicode to ISO-8859-1 does the same thing to eta as
              > converting from ISO-8859-7 to ISO-8859-1 does: it fails, since the
              > character isn't available:
              >
              > 07:12 PM glenn@.../2 [~] iconv -t ISO-8859-7 | iconv -f ISO-8859-7
              > -t ISO-8859-1//TRANSLIT
              > Î-
              > ?
              >
              > If iconv provided an inaccurate translit for UTF8->8859-1, it'd probably
              > be inaccurate for 8859-7, too.
              >
              > So, if this is a problem, I don't see how it's any worse with Unicode.
              > Could you expound a bit?
              >
              > --
              > Glenn Maynard
              >
            • Glenn Maynard
              ... Lynx has its own transliteration. It s poor to uselessness, in my experience. (I ve only tried it with Japanese hiragana; that s an easy translit, though
              Message 6 of 6 , Aug 16, 2002
              • 0 Attachment
                On Sat, Aug 17, 2002 at 01:25:26AM +0200, Tony Mechelynck wrote:
                > Well, maybe not iconv then. (Guess I'm a little out of my depth in this ml.)
                > Or is it some browsers trying to "improve" on non-available from-unicode
                > conversions? I know Lynx converts Unicode eta to cp437 y and I think

                Lynx has its own transliteration. It's poor to uselessness, in my
                experience. (I've only tried it with Japanese hiragana; that's an easy
                translit, though it does need some context.)

                > Konqueror does the same when its default display is iso-8859-1 but I'm less
                > sure of that. Yesterday I sent a Unicode mail with a sentence in Esperanto
                > and got it back quoted in some non-unicode reply mail from the US with the
                > circumflexes stripped but the letters otherwise OK. Just haphazard symptoms,

                Lots of mailers break encodings in quoting; it's the mailer's fault.
                It happens with all encodings.

                > as you see. Maybe I shoulda kept silent.

                Chime in all you want; it's a public list. :) There's just enough fear
                of incompatibility on this list that I'm trying to keep any new false
                ones from forming ...

                --
                Glenn Maynard
              Your message has been successfully submitted and would be delivered to recipients shortly.