Loading ...
Sorry, an error occurred while loading the content.

RE: [jasspa] UTF-8

Expand Messages
  • Sabin, Bruno
    Add me on the waiting list for this one too ;) -- bruno _________________________________________________________________ Bruno Sabin
    Message 1 of 7 , Dec 5, 2005
    • 0 Attachment
      Add me on the waiting list for this one too ;)

      --
      bruno

      _________________________________________________________________
      Bruno Sabin Parametric Technology Corp.

      "Memory is like gasoline. You use it up when you are running. Of
      course you get it all back when you reboot...";
      -- Actual explanation obtained from the Micro$oft help desk.



      -----Original Message-----
      From: jasspa@yahoogroups.com [mailto:jasspa@yahoogroups.com] On Behalf Of Christof Boeckler
      Sent: Monday, December 05, 2005 10:41 AM
      To: Jasspa ML
      Subject: [jasspa] UTF-8

      Hi,

      are there any plans to support the UTF-8 encoding? (Not in NE, but at least in ME.)

      Gruß / Regards
      Christof

      --
      http://home.in.tum.de/~boeckler/ http://www.spiegel.de/zwiebelfisch
      Da sie sich für weise hielten, sind sie zu Narren geworden. Rö 1,22



      __________________________________________________________________________

      This is an unmoderated list. JASSPA is not responsible for the content of
      any material posted to this list.

      To unsubscribe, send a mail message to

      mailto:jasspa-unsubscribe@yahoogroups.com

      or visit http://groups.yahoo.com/group/jasspa and
      modify your account settings manually.



      Yahoo! Groups Links
    • Christof Boeckler
      ... Maybe it is of some interest: QEmacs http://fabrice.bellard.free.fr/qemacs/ seams to have Full UTF8 support and it is LGPL. So maybe you can get some
      Message 2 of 7 , Dec 5, 2005
      • 0 Attachment
        Christof Boeckler schrieb:
        > are there any plans to support the UTF-8 encoding? (Not in NE, but at
        > least in ME.)

        Maybe it is of some interest: QEmacs
        http://fabrice.bellard.free.fr/qemacs/
        seams to have "Full UTF8 support" and it is LGPL. So maybe you can get
        some inspirations on how to do it.

        Gruß / Regards
        Christof

        --
        http://home.in.tum.de/~boeckler/ http://www.spiegel.de/zwiebelfisch
        Da sie sich für weise hielten, sind sie zu Narren geworden. Rö 1,22
      • Phillips, Steven
        This is really difficult to support well, ideally ME would need to go wide char (i.e. use a short rather than a byte to store a single character) to support
        Message 3 of 7 , Dec 5, 2005
        • 0 Attachment
          This is really difficult to support well, ideally ME would need to go wide char (i.e. use a short rather than a byte to store a single character) to support utf8 in any meaningful way - this would have a huge impact of the ME kernel. Using a short instead of a byte as a character means that for a ' ' (a space char) internally ME would stores '\x00\x20' the initial 0 is a killer, any string manipulation/comparison functionality would require fixing - this is too much work for me.

          Alternatively ME could support multi-byte code pages, this solves the '\x00' problem but still has a major impact. For example what does \l in regex (lower case letter) mean in the context of uft8? This type of grouping affects many parts of the ME kernel (e.g. hilighting, spelling, regex, word based cursor movement). Also, as doing a strlen on a utf8 string may not return the number of 'characters', many assumptions in ME start falling over - again I think this is far too much work.

          So the dirty solution - conversion of utf8 (and potentially unicode) to the system's current single byte code page. This is feasible but the big issue here is how do you handle errors? If my current code page does not support a particular character from the file what happens? I'm guessing that the least that should happen is the user must be made aware that information has been lost. But is this good enough? Or should it play it safe and abort the load?

          Given this, would this approach be of any use to anyone? With a restriction of it having to convert to a single byte code-page languages like Chinese or Japanese simply could not be supported. Could this be of any use?

          Steve

          > -----Original Message-----
          > From: jasspa@yahoogroups.com [mailto:jasspa@yahoogroups.com] On Behalf Of
          > Christof Boeckler
          > Sent: Monday, December 05, 2005 10:41 AM
          > To: Jasspa ML
          > Subject: [jasspa] UTF-8
          >
          > Hi,
          >
          > are there any plans to support the UTF-8 encoding? (Not in NE, but at
          > least in ME.)
          >
          > Gruß / Regards
          > Christof
          >
          > --
          > http://home.in.tum.de/~boeckler/ http://www.spiegel.de/zwiebelfisch
          > Da sie sich für weise hielten, sind sie zu Narren geworden. Rö 1,22
          >
          >
          >
          > __________________________________________________________________________
          >
          > This is an unmoderated list. JASSPA is not responsible for the content of
          > any material posted to this list.
          >
          > To unsubscribe, send a mail message to
          >
          > mailto:jasspa-unsubscribe@yahoogroups.com
          >
          > or visit http://groups.yahoo.com/group/jasspa and
          > modify your account settings manually.
          >
          >
          >
          > Yahoo! Groups Links
          >
          >
          >
          >
          >
        • Thomas Hundt
          I m curious, what are people editing that they need UTF-8 for? I haven t encountered it yet. (Maybe I m just behind the times?) -Th ... -- Thomas Hundt
          Message 4 of 7 , Dec 5, 2005
          • 0 Attachment
            I'm curious, what are people editing that they need UTF-8 for?
            I haven't encountered it yet. (Maybe I'm just behind the times?)

            -Th

            >> are there any plans to support the UTF-8 encoding? (Not in NE, but at
            >> least in ME.)


            --
            Thomas Hundt <tom@...> +1-415-867-6698
          • Sabin, Bruno
            ... UTF-8, Unicode, multibyte chars ... there all over the place. I have to use GNU Emacs when dealing with multibyte localized header files (Software I18N
            Message 5 of 7 , Dec 5, 2005
            • 0 Attachment
              >> I'm curious, what are people editing that they need UTF-8 for?
              >> I
              haven't encountered it yet.  (Maybe I'm just behind the times?)

              UTF-8, Unicode, multibyte chars ... there all over the place.
              I have to use GNU Emacs when dealing with multibyte localized header files (Software I18N jobby), InstallShield ASCII exports, or simply regedit exported .reg files (when not exported as win9x/NT4 format), MSI generated log files, etc ...  
              Do we have any Asian language users in ME yet? ;) Coz' that could be a show stopper.
              Bruno
              --
              _________________________________________________________________
              Bruno Sabin                           Parametric Technology Corp.
              
                  To err lies in the nature of humans,
                      but to really fool things up you need a computer
              
              
               


              From: jasspa@yahoogroups.com [mailto:jasspa@yahoogroups.com] On Behalf Of Thomas Hundt
              Sent: Monday, December 05, 2005 4:39 PM
              To: jasspa@yahoogroups.com
              Subject: Re: [jasspa] UTF-8

              I'm curious, what are people editing that they need UTF-8 for?
              I haven't encountered it yet.  (Maybe I'm just behind the times?)

              -Th

              >> are there any plans to support the UTF-8
              encoding? (Not in NE, but at
              >> least in ME.)


              --
              Thomas Hundt <tom@...> +1-415-867-6698
            • Thomas Hundt
              I think simply using a short instead of a char will NOT do the trick. It would really need up to four bytes per char. http://en.wikipedia.org/wiki/UTF-8 Pretty
              Message 6 of 7 , Dec 10, 2005
              • 0 Attachment
                I think simply using a short instead of a char will NOT do the trick.
                It would really need up to four bytes per char.

                http://en.wikipedia.org/wiki/UTF-8

                Pretty interesting design. I think you're wrong about the embedded
                zeroes -- in UTF-16 you'd have that problem, but not in UTF-8. (Cf. the
                comparison chart.) On the other hand, you'd still need to rewrite the
                string routines, to handle the 1-4 byte length. A NUL is a valid char
                (0x00), but you could just say you don't support it (who puts those into
                a file, anyway) :-) The rest of the bytes are guaranteed non-zero.
                Even better, one character's code is guaranteed not to appear inside
                another's, making searches and comparison easy.

                -Th


                Phillips, Steven wrote:
                > This is really difficult to support well, ideally ME would need to go
                > wide char (i.e. use a short rather than a byte to store a single
                > character) to support utf8 in any meaningful way - this would have a
                > huge impact of the ME kernel. Using a short instead of a byte as a
                > character means that for a ' ' (a space char) internally ME would
                > stores '\x00\x20' the initial 0 is a killer, any string
                > manipulation/comparison functionality would require fixing - this is
                > too much work for me.
                >
                > Alternatively ME could support multi-byte code pages, this solves the
                > '\x00' problem but still has a major impact. For example what does \l
                > in regex (lower case letter) mean in the context of uft8? This type
                > of grouping affects many parts of the ME kernel (e.g. hilighting,
                > spelling, regex, word based cursor movement). Also, as doing a strlen
                > on a utf8 string may not return the number of 'characters', many
                > assumptions in ME start falling over - again I think this is far too
                > much work.
                >
                > So the dirty solution - conversion of utf8 (and potentially unicode)
                > to the system's current single byte code page. This is feasible but
                > the big issue here is how do you handle errors? If my current code
                > page does not support a particular character from the file what
                > happens? I'm guessing that the least that should happen is the user
                > must be made aware that information has been lost. But is this good
                > enough? Or should it play it safe and abort the load?
                >
                > Given this, would this approach be of any use to anyone? With a
                > restriction of it having to convert to a single byte code-page
                > languages like Chinese or Japanese simply could not be supported.
                > Could this be of any use?
                >
                > Steve
                >
                >> -----Original Message----- From: jasspa@yahoogroups.com
                >> [mailto:jasspa@yahoogroups.com] On Behalf Of Christof Boeckler
                >> Sent: Monday, December 05, 2005 10:41 AM To: Jasspa ML Subject:
                >> [jasspa] UTF-8
                >>
                >> Hi,
                >>
                >> are there any plans to support the UTF-8 encoding? (Not in NE, but
                >> at least in ME.)
                >>
                >> Gruß / Regards Christof
              Your message has been successfully submitted and would be delivered to recipients shortly.