Loading ...
Sorry, an error occurred while loading the content.

Re: [jasspa] UTF-8

Expand Messages
  • Thomas Hundt
    I m curious, what are people editing that they need UTF-8 for? I haven t encountered it yet. (Maybe I m just behind the times?) -Th ... -- Thomas Hundt
    Message 1 of 7 , Dec 5, 2005
    • 0 Attachment
      I'm curious, what are people editing that they need UTF-8 for?
      I haven't encountered it yet. (Maybe I'm just behind the times?)

      -Th

      >> are there any plans to support the UTF-8 encoding? (Not in NE, but at
      >> least in ME.)


      --
      Thomas Hundt <tom@...> +1-415-867-6698
    • Sabin, Bruno
      ... UTF-8, Unicode, multibyte chars ... there all over the place. I have to use GNU Emacs when dealing with multibyte localized header files (Software I18N
      Message 2 of 7 , Dec 5, 2005
      • 0 Attachment
        >> I'm curious, what are people editing that they need UTF-8 for?
        >> I
        haven't encountered it yet.  (Maybe I'm just behind the times?)

        UTF-8, Unicode, multibyte chars ... there all over the place.
        I have to use GNU Emacs when dealing with multibyte localized header files (Software I18N jobby), InstallShield ASCII exports, or simply regedit exported .reg files (when not exported as win9x/NT4 format), MSI generated log files, etc ...  
        Do we have any Asian language users in ME yet? ;) Coz' that could be a show stopper.
        Bruno
        --
        _________________________________________________________________
        Bruno Sabin                           Parametric Technology Corp.
        
            To err lies in the nature of humans,
                but to really fool things up you need a computer
        
        
         


        From: jasspa@yahoogroups.com [mailto:jasspa@yahoogroups.com] On Behalf Of Thomas Hundt
        Sent: Monday, December 05, 2005 4:39 PM
        To: jasspa@yahoogroups.com
        Subject: Re: [jasspa] UTF-8

        I'm curious, what are people editing that they need UTF-8 for?
        I haven't encountered it yet.  (Maybe I'm just behind the times?)

        -Th

        >> are there any plans to support the UTF-8
        encoding? (Not in NE, but at
        >> least in ME.)


        --
        Thomas Hundt <tom@...> +1-415-867-6698
      • Thomas Hundt
        I think simply using a short instead of a char will NOT do the trick. It would really need up to four bytes per char. http://en.wikipedia.org/wiki/UTF-8 Pretty
        Message 3 of 7 , Dec 10, 2005
        • 0 Attachment
          I think simply using a short instead of a char will NOT do the trick.
          It would really need up to four bytes per char.

          http://en.wikipedia.org/wiki/UTF-8

          Pretty interesting design. I think you're wrong about the embedded
          zeroes -- in UTF-16 you'd have that problem, but not in UTF-8. (Cf. the
          comparison chart.) On the other hand, you'd still need to rewrite the
          string routines, to handle the 1-4 byte length. A NUL is a valid char
          (0x00), but you could just say you don't support it (who puts those into
          a file, anyway) :-) The rest of the bytes are guaranteed non-zero.
          Even better, one character's code is guaranteed not to appear inside
          another's, making searches and comparison easy.

          -Th


          Phillips, Steven wrote:
          > This is really difficult to support well, ideally ME would need to go
          > wide char (i.e. use a short rather than a byte to store a single
          > character) to support utf8 in any meaningful way - this would have a
          > huge impact of the ME kernel. Using a short instead of a byte as a
          > character means that for a ' ' (a space char) internally ME would
          > stores '\x00\x20' the initial 0 is a killer, any string
          > manipulation/comparison functionality would require fixing - this is
          > too much work for me.
          >
          > Alternatively ME could support multi-byte code pages, this solves the
          > '\x00' problem but still has a major impact. For example what does \l
          > in regex (lower case letter) mean in the context of uft8? This type
          > of grouping affects many parts of the ME kernel (e.g. hilighting,
          > spelling, regex, word based cursor movement). Also, as doing a strlen
          > on a utf8 string may not return the number of 'characters', many
          > assumptions in ME start falling over - again I think this is far too
          > much work.
          >
          > So the dirty solution - conversion of utf8 (and potentially unicode)
          > to the system's current single byte code page. This is feasible but
          > the big issue here is how do you handle errors? If my current code
          > page does not support a particular character from the file what
          > happens? I'm guessing that the least that should happen is the user
          > must be made aware that information has been lost. But is this good
          > enough? Or should it play it safe and abort the load?
          >
          > Given this, would this approach be of any use to anyone? With a
          > restriction of it having to convert to a single byte code-page
          > languages like Chinese or Japanese simply could not be supported.
          > Could this be of any use?
          >
          > Steve
          >
          >> -----Original Message----- From: jasspa@yahoogroups.com
          >> [mailto:jasspa@yahoogroups.com] On Behalf Of Christof Boeckler
          >> Sent: Monday, December 05, 2005 10:41 AM To: Jasspa ML Subject:
          >> [jasspa] UTF-8
          >>
          >> Hi,
          >>
          >> are there any plans to support the UTF-8 encoding? (Not in NE, but
          >> at least in ME.)
          >>
          >> Gruß / Regards Christof
        Your message has been successfully submitted and would be delivered to recipients shortly.