938Re: Wrong characters count when using utf-8

  • jmaiorana@idirect.net
    May 28, 2003
    • 0 Attachment
      >Since 'strlen(expr)' returns bytes, not characters - the result is
      >wrong when we use utf-8. If there is a way to count _characters_ (not
      >Thank you.

      You could iterate over the line and only count bytes which are not
      continuing characters,
      which are those between 0x80 and 0xBF inclusive.
      perhaps a function called "character_count()" could be implemented...

      However, this will be misleading because it will not detect composing
      non-printing characters, zero width characters, invalid code points, the
      new and
      obnoxious formatting hint SOFT HYPHEN (which may or may not be visible
      depending upon the context), context sensitive glyphs which may change from
      whole characters to composed partion characters, etc, etc, etc...

      A function which supported all of that would be very complicated, would be
      dependant upon a specific version of the unicode standard, and may require
      support for various different language-specific-contexts (giving a different
      result depending upon which language the text is considered to be in).

      Anectdotally, most people say that an actual character count is not that
      when you consider the range of languages supported by unicode.
