Loading ...
Sorry, an error occurred while loading the content.

RE: [jasspa] UTF-8

Expand Messages
  • Sabin, Bruno
    ... UTF-8, Unicode, multibyte chars ... there all over the place. I have to use GNU Emacs when dealing with multibyte localized header files (Software I18N
    Message 1 of 7 , Dec 5, 2005
    • 0 Attachment
      >> I'm curious, what are people editing that they need UTF-8 for?
      >> I
      haven't encountered it yet.  (Maybe I'm just behind the times?)

      UTF-8, Unicode, multibyte chars ... there all over the place.
      I have to use GNU Emacs when dealing with multibyte localized header files (Software I18N jobby), InstallShield ASCII exports, or simply regedit exported .reg files (when not exported as win9x/NT4 format), MSI generated log files, etc ...  
      Do we have any Asian language users in ME yet? ;) Coz' that could be a show stopper.
      Bruno
      --
      _________________________________________________________________
      Bruno Sabin                           Parametric Technology Corp.
      
          To err lies in the nature of humans,
              but to really fool things up you need a computer
      
      
       


      From: jasspa@yahoogroups.com [mailto:jasspa@yahoogroups.com] On Behalf Of Thomas Hundt
      Sent: Monday, December 05, 2005 4:39 PM
      To: jasspa@yahoogroups.com
      Subject: Re: [jasspa] UTF-8

      I'm curious, what are people editing that they need UTF-8 for?
      I haven't encountered it yet.  (Maybe I'm just behind the times?)

      -Th

      >> are there any plans to support the UTF-8
      encoding? (Not in NE, but at
      >> least in ME.)


      --
      Thomas Hundt <tom@...> +1-415-867-6698
    • Thomas Hundt
      I think simply using a short instead of a char will NOT do the trick. It would really need up to four bytes per char. http://en.wikipedia.org/wiki/UTF-8 Pretty
      Message 2 of 7 , Dec 10, 2005
      • 0 Attachment
        I think simply using a short instead of a char will NOT do the trick.
        It would really need up to four bytes per char.

        http://en.wikipedia.org/wiki/UTF-8

        Pretty interesting design. I think you're wrong about the embedded
        zeroes -- in UTF-16 you'd have that problem, but not in UTF-8. (Cf. the
        comparison chart.) On the other hand, you'd still need to rewrite the
        string routines, to handle the 1-4 byte length. A NUL is a valid char
        (0x00), but you could just say you don't support it (who puts those into
        a file, anyway) :-) The rest of the bytes are guaranteed non-zero.
        Even better, one character's code is guaranteed not to appear inside
        another's, making searches and comparison easy.

        -Th


        Phillips, Steven wrote:
        > This is really difficult to support well, ideally ME would need to go
        > wide char (i.e. use a short rather than a byte to store a single
        > character) to support utf8 in any meaningful way - this would have a
        > huge impact of the ME kernel. Using a short instead of a byte as a
        > character means that for a ' ' (a space char) internally ME would
        > stores '\x00\x20' the initial 0 is a killer, any string
        > manipulation/comparison functionality would require fixing - this is
        > too much work for me.
        >
        > Alternatively ME could support multi-byte code pages, this solves the
        > '\x00' problem but still has a major impact. For example what does \l
        > in regex (lower case letter) mean in the context of uft8? This type
        > of grouping affects many parts of the ME kernel (e.g. hilighting,
        > spelling, regex, word based cursor movement). Also, as doing a strlen
        > on a utf8 string may not return the number of 'characters', many
        > assumptions in ME start falling over - again I think this is far too
        > much work.
        >
        > So the dirty solution - conversion of utf8 (and potentially unicode)
        > to the system's current single byte code page. This is feasible but
        > the big issue here is how do you handle errors? If my current code
        > page does not support a particular character from the file what
        > happens? I'm guessing that the least that should happen is the user
        > must be made aware that information has been lost. But is this good
        > enough? Or should it play it safe and abort the load?
        >
        > Given this, would this approach be of any use to anyone? With a
        > restriction of it having to convert to a single byte code-page
        > languages like Chinese or Japanese simply could not be supported.
        > Could this be of any use?
        >
        > Steve
        >
        >> -----Original Message----- From: jasspa@yahoogroups.com
        >> [mailto:jasspa@yahoogroups.com] On Behalf Of Christof Boeckler
        >> Sent: Monday, December 05, 2005 10:41 AM To: Jasspa ML Subject:
        >> [jasspa] UTF-8
        >>
        >> Hi,
        >>
        >> are there any plans to support the UTF-8 encoding? (Not in NE, but
        >> at least in ME.)
        >>
        >> Gruß / Regards Christof
      Your message has been successfully submitted and would be delivered to recipients shortly.