938Re: Wrong characters count when using utf-8
- May 28, 2003
>You could iterate over the line and only count bytes which are not
>Since 'strlen(expr)' returns bytes, not characters - the result is
>wrong when we use utf-8. If there is a way to count _characters_ (not
which are those between 0x80 and 0xBF inclusive.
perhaps a function called "character_count()" could be implemented...
However, this will be misleading because it will not detect composing
non-printing characters, zero width characters, invalid code points, the
obnoxious formatting hint SOFT HYPHEN (which may or may not be visible
depending upon the context), context sensitive glyphs which may change from
whole characters to composed partion characters, etc, etc, etc...
A function which supported all of that would be very complicated, would be
dependant upon a specific version of the unicode standard, and may require
support for various different language-specific-contexts (giving a different
result depending upon which language the text is considered to be in).
Anectdotally, most people say that an actual character count is not that
when you consider the range of languages supported by unicode.
- << Previous post in topic Next post in topic >>