Loading ...
Sorry, an error occurred while loading the content.
 

UTF-8

Expand Messages
  • Christof Boeckler
    Hi, are there any plans to support the UTF-8 encoding? (Not in NE, but at least in ME.) Gruß / Regards Christof -- http://home.in.tum.de/~boeckler/
    Message 1 of 7 , Dec 5, 2005
      Hi,

      are there any plans to support the UTF-8 encoding? (Not in NE, but at
      least in ME.)

      Gruß / Regards
      Christof

      --
      http://home.in.tum.de/~boeckler/ http://www.spiegel.de/zwiebelfisch
      Da sie sich für weise hielten, sind sie zu Narren geworden. Rö 1,22
    • Sabin, Bruno
      Add me on the waiting list for this one too ;) -- bruno _________________________________________________________________ Bruno Sabin
      Message 2 of 7 , Dec 5, 2005
        Add me on the waiting list for this one too ;)

        --
        bruno

        _________________________________________________________________
        Bruno Sabin Parametric Technology Corp.

        "Memory is like gasoline. You use it up when you are running. Of
        course you get it all back when you reboot...";
        -- Actual explanation obtained from the Micro$oft help desk.



        -----Original Message-----
        From: jasspa@yahoogroups.com [mailto:jasspa@yahoogroups.com] On Behalf Of Christof Boeckler
        Sent: Monday, December 05, 2005 10:41 AM
        To: Jasspa ML
        Subject: [jasspa] UTF-8

        Hi,

        are there any plans to support the UTF-8 encoding? (Not in NE, but at least in ME.)

        Gruß / Regards
        Christof

        --
        http://home.in.tum.de/~boeckler/ http://www.spiegel.de/zwiebelfisch
        Da sie sich für weise hielten, sind sie zu Narren geworden. Rö 1,22



        __________________________________________________________________________

        This is an unmoderated list. JASSPA is not responsible for the content of
        any material posted to this list.

        To unsubscribe, send a mail message to

        mailto:jasspa-unsubscribe@yahoogroups.com

        or visit http://groups.yahoo.com/group/jasspa and
        modify your account settings manually.



        Yahoo! Groups Links
      • Christof Boeckler
        ... Maybe it is of some interest: QEmacs http://fabrice.bellard.free.fr/qemacs/ seams to have Full UTF8 support and it is LGPL. So maybe you can get some
        Message 3 of 7 , Dec 5, 2005
          Christof Boeckler schrieb:
          > are there any plans to support the UTF-8 encoding? (Not in NE, but at
          > least in ME.)

          Maybe it is of some interest: QEmacs
          http://fabrice.bellard.free.fr/qemacs/
          seams to have "Full UTF8 support" and it is LGPL. So maybe you can get
          some inspirations on how to do it.

          Gruß / Regards
          Christof

          --
          http://home.in.tum.de/~boeckler/ http://www.spiegel.de/zwiebelfisch
          Da sie sich für weise hielten, sind sie zu Narren geworden. Rö 1,22
        • Phillips, Steven
          This is really difficult to support well, ideally ME would need to go wide char (i.e. use a short rather than a byte to store a single character) to support
          Message 4 of 7 , Dec 5, 2005
            This is really difficult to support well, ideally ME would need to go wide char (i.e. use a short rather than a byte to store a single character) to support utf8 in any meaningful way - this would have a huge impact of the ME kernel. Using a short instead of a byte as a character means that for a ' ' (a space char) internally ME would stores '\x00\x20' the initial 0 is a killer, any string manipulation/comparison functionality would require fixing - this is too much work for me.

            Alternatively ME could support multi-byte code pages, this solves the '\x00' problem but still has a major impact. For example what does \l in regex (lower case letter) mean in the context of uft8? This type of grouping affects many parts of the ME kernel (e.g. hilighting, spelling, regex, word based cursor movement). Also, as doing a strlen on a utf8 string may not return the number of 'characters', many assumptions in ME start falling over - again I think this is far too much work.

            So the dirty solution - conversion of utf8 (and potentially unicode) to the system's current single byte code page. This is feasible but the big issue here is how do you handle errors? If my current code page does not support a particular character from the file what happens? I'm guessing that the least that should happen is the user must be made aware that information has been lost. But is this good enough? Or should it play it safe and abort the load?

            Given this, would this approach be of any use to anyone? With a restriction of it having to convert to a single byte code-page languages like Chinese or Japanese simply could not be supported. Could this be of any use?

            Steve

            > -----Original Message-----
            > From: jasspa@yahoogroups.com [mailto:jasspa@yahoogroups.com] On Behalf Of
            > Christof Boeckler
            > Sent: Monday, December 05, 2005 10:41 AM
            > To: Jasspa ML
            > Subject: [jasspa] UTF-8
            >
            > Hi,
            >
            > are there any plans to support the UTF-8 encoding? (Not in NE, but at
            > least in ME.)
            >
            > Gruß / Regards
            > Christof
            >
            > --
            > http://home.in.tum.de/~boeckler/ http://www.spiegel.de/zwiebelfisch
            > Da sie sich für weise hielten, sind sie zu Narren geworden. Rö 1,22
            >
            >
            >
            > __________________________________________________________________________
            >
            > This is an unmoderated list. JASSPA is not responsible for the content of
            > any material posted to this list.
            >
            > To unsubscribe, send a mail message to
            >
            > mailto:jasspa-unsubscribe@yahoogroups.com
            >
            > or visit http://groups.yahoo.com/group/jasspa and
            > modify your account settings manually.
            >
            >
            >
            > Yahoo! Groups Links
            >
            >
            >
            >
            >
          • Thomas Hundt
            I m curious, what are people editing that they need UTF-8 for? I haven t encountered it yet. (Maybe I m just behind the times?) -Th ... -- Thomas Hundt
            Message 5 of 7 , Dec 5, 2005
              I'm curious, what are people editing that they need UTF-8 for?
              I haven't encountered it yet. (Maybe I'm just behind the times?)

              -Th

              >> are there any plans to support the UTF-8 encoding? (Not in NE, but at
              >> least in ME.)


              --
              Thomas Hundt <tom@...> +1-415-867-6698
            • Sabin, Bruno
              ... UTF-8, Unicode, multibyte chars ... there all over the place. I have to use GNU Emacs when dealing with multibyte localized header files (Software I18N
              Message 6 of 7 , Dec 5, 2005
                >> I'm curious, what are people editing that they need UTF-8 for?
                >> I
                haven't encountered it yet.  (Maybe I'm just behind the times?)

                UTF-8, Unicode, multibyte chars ... there all over the place.
                I have to use GNU Emacs when dealing with multibyte localized header files (Software I18N jobby), InstallShield ASCII exports, or simply regedit exported .reg files (when not exported as win9x/NT4 format), MSI generated log files, etc ...  
                Do we have any Asian language users in ME yet? ;) Coz' that could be a show stopper.
                Bruno
                --
                _________________________________________________________________
                Bruno Sabin                           Parametric Technology Corp.
                
                    To err lies in the nature of humans,
                        but to really fool things up you need a computer
                
                
                 


                From: jasspa@yahoogroups.com [mailto:jasspa@yahoogroups.com] On Behalf Of Thomas Hundt
                Sent: Monday, December 05, 2005 4:39 PM
                To: jasspa@yahoogroups.com
                Subject: Re: [jasspa] UTF-8

                I'm curious, what are people editing that they need UTF-8 for?
                I haven't encountered it yet.  (Maybe I'm just behind the times?)

                -Th

                >> are there any plans to support the UTF-8
                encoding? (Not in NE, but at
                >> least in ME.)


                --
                Thomas Hundt <tom@...> +1-415-867-6698
              • Thomas Hundt
                I think simply using a short instead of a char will NOT do the trick. It would really need up to four bytes per char. http://en.wikipedia.org/wiki/UTF-8 Pretty
                Message 7 of 7 , Dec 10, 2005
                  I think simply using a short instead of a char will NOT do the trick.
                  It would really need up to four bytes per char.

                  http://en.wikipedia.org/wiki/UTF-8

                  Pretty interesting design. I think you're wrong about the embedded
                  zeroes -- in UTF-16 you'd have that problem, but not in UTF-8. (Cf. the
                  comparison chart.) On the other hand, you'd still need to rewrite the
                  string routines, to handle the 1-4 byte length. A NUL is a valid char
                  (0x00), but you could just say you don't support it (who puts those into
                  a file, anyway) :-) The rest of the bytes are guaranteed non-zero.
                  Even better, one character's code is guaranteed not to appear inside
                  another's, making searches and comparison easy.

                  -Th


                  Phillips, Steven wrote:
                  > This is really difficult to support well, ideally ME would need to go
                  > wide char (i.e. use a short rather than a byte to store a single
                  > character) to support utf8 in any meaningful way - this would have a
                  > huge impact of the ME kernel. Using a short instead of a byte as a
                  > character means that for a ' ' (a space char) internally ME would
                  > stores '\x00\x20' the initial 0 is a killer, any string
                  > manipulation/comparison functionality would require fixing - this is
                  > too much work for me.
                  >
                  > Alternatively ME could support multi-byte code pages, this solves the
                  > '\x00' problem but still has a major impact. For example what does \l
                  > in regex (lower case letter) mean in the context of uft8? This type
                  > of grouping affects many parts of the ME kernel (e.g. hilighting,
                  > spelling, regex, word based cursor movement). Also, as doing a strlen
                  > on a utf8 string may not return the number of 'characters', many
                  > assumptions in ME start falling over - again I think this is far too
                  > much work.
                  >
                  > So the dirty solution - conversion of utf8 (and potentially unicode)
                  > to the system's current single byte code page. This is feasible but
                  > the big issue here is how do you handle errors? If my current code
                  > page does not support a particular character from the file what
                  > happens? I'm guessing that the least that should happen is the user
                  > must be made aware that information has been lost. But is this good
                  > enough? Or should it play it safe and abort the load?
                  >
                  > Given this, would this approach be of any use to anyone? With a
                  > restriction of it having to convert to a single byte code-page
                  > languages like Chinese or Japanese simply could not be supported.
                  > Could this be of any use?
                  >
                  > Steve
                  >
                  >> -----Original Message----- From: jasspa@yahoogroups.com
                  >> [mailto:jasspa@yahoogroups.com] On Behalf Of Christof Boeckler
                  >> Sent: Monday, December 05, 2005 10:41 AM To: Jasspa ML Subject:
                  >> [jasspa] UTF-8
                  >>
                  >> Hi,
                  >>
                  >> are there any plans to support the UTF-8 encoding? (Not in NE, but
                  >> at least in ME.)
                  >>
                  >> Gruß / Regards Christof
                Your message has been successfully submitted and would be delivered to recipients shortly.