RE: [jasspa] UTF-8
- >> I'm curious, what are people editing that they need UTF-8 for?
>> Ihaven't encountered it yet. (Maybe I'm just behind the times?)UTF-8, Unicode, multibyte chars ... there all over the place.I have to use GNU Emacs when dealing with multibyte localized header files (Software I18N jobby), InstallShield ASCII exports, or simply regedit exported .reg files (when not exported as win9x/NT4 format), MSI generated log files, etc ...Do we have any Asian language users in ME yet? ;) Coz' that could be a show stopper.
Bruno -- _________________________________________________________________ Bruno Sabin Parametric Technology Corp. To err lies in the nature of humans, but to really fool things up you need a computerI'm curious, what are people editing that they need UTF-8 for?
From: firstname.lastname@example.org [mailto:email@example.com] On Behalf Of Thomas Hundt
Sent: Monday, December 05, 2005 4:39 PM
Subject: Re: [jasspa] UTF-8
I haven't encountered it yet. (Maybe I'm just behind the times?)
>> are there any plans to support the UTF-8encoding? (Not in NE, but at
>> least in ME.)--
Thomas Hundt <tom@...> +1-415-867-6698
- I think simply using a short instead of a char will NOT do the trick.
It would really need up to four bytes per char.
Pretty interesting design. I think you're wrong about the embedded
zeroes -- in UTF-16 you'd have that problem, but not in UTF-8. (Cf. the
comparison chart.) On the other hand, you'd still need to rewrite the
string routines, to handle the 1-4 byte length. A NUL is a valid char
(0x00), but you could just say you don't support it (who puts those into
a file, anyway) :-) The rest of the bytes are guaranteed non-zero.
Even better, one character's code is guaranteed not to appear inside
another's, making searches and comparison easy.
Phillips, Steven wrote:
> This is really difficult to support well, ideally ME would need to go
> wide char (i.e. use a short rather than a byte to store a single
> character) to support utf8 in any meaningful way - this would have a
> huge impact of the ME kernel. Using a short instead of a byte as a
> character means that for a ' ' (a space char) internally ME would
> stores '\x00\x20' the initial 0 is a killer, any string
> manipulation/comparison functionality would require fixing - this is
> too much work for me.
> Alternatively ME could support multi-byte code pages, this solves the
> '\x00' problem but still has a major impact. For example what does \l
> in regex (lower case letter) mean in the context of uft8? This type
> of grouping affects many parts of the ME kernel (e.g. hilighting,
> spelling, regex, word based cursor movement). Also, as doing a strlen
> on a utf8 string may not return the number of 'characters', many
> assumptions in ME start falling over - again I think this is far too
> much work.
> So the dirty solution - conversion of utf8 (and potentially unicode)
> to the system's current single byte code page. This is feasible but
> the big issue here is how do you handle errors? If my current code
> page does not support a particular character from the file what
> happens? I'm guessing that the least that should happen is the user
> must be made aware that information has been lost. But is this good
> enough? Or should it play it safe and abort the load?
> Given this, would this approach be of any use to anyone? With a
> restriction of it having to convert to a single byte code-page
> languages like Chinese or Japanese simply could not be supported.
> Could this be of any use?
>> -----Original Message----- From: firstname.lastname@example.org
>> [mailto:email@example.com] On Behalf Of Christof Boeckler
>> Sent: Monday, December 05, 2005 10:41 AM To: Jasspa ML Subject:
>> [jasspa] UTF-8
>> are there any plans to support the UTF-8 encoding? (Not in NE, but
>> at least in ME.)
>> Gruß / Regards Christof