Loading ...
Sorry, an error occurred while loading the content.

Wrong characters count when using utf-8

Expand Messages
  • Valery Kondakoff
    Hello, Bram! ... BM The byte count changes. Using chars in the file message is an old Vi BM habit. Look at this as the C language char , which is
    Message 1 of 6 , May 28 6:49 AM
    • 0 Attachment
      Hello, Bram!

      Tuesday, May 27, 2003, you wrote to me:

      >> > As far as I understand the character count is _not_ changed when
      >> > converting, but wrong character count is displayed.

      BM> The byte count changes. Using "chars" in the file message is an old Vi
      BM> habit. Look at this as the C language "char", which is actually a byte.

      Aha. Now I understand this. Thank you for explanation. But I still
      don't understand how can I count _characters_, not bytes when using
      'utf-8'?

      I was using this function to count characters selected in Visual mode
      before I switched to 'utf-8':

      " If the argument is 0, newlines are not counted in blockwise visual mode
      fun! VisualCount(newlines)
      let mode = mode()
      let ret = ""
      if mode ==? 'v' || mode ==# nr2char(22)
      let opt_report = &report
      let &report = 2147483647 " 2^31 - 1
      norm "-ygv
      let &report = opt_report
      let len = strlen(@-)
      if mode ==# nr2char(22) && a:newlines == 0
      let len = len - (line("'>") - line("'<"))
      endif
      else
      let len = 0
      endif
      if len > 0
      let ret = '[' . len . ']'
      endif
      return ret
      endfun

      Since 'strlen(expr)' returns bytes, not characters - the result is
      wrong when we use utf-8. If there is a way to count _characters_ (not
      bytes)?

      Thank you.


      --
      Best regards,
      Valery Kondakoff
      http://www.nbk.orc.ru (Ne Bey Kopytom)
      http://www.nbk.orc.ru/mtb (MTB riding in Moscow)

      PGP key: mailto:pgp-public-keys@...?subject=GET%20strauss@...
    • jmaiorana@idirect.net
      ... You could iterate over the line and only count bytes which are not continuing characters, which are those between 0x80 and 0xBF inclusive. perhaps a
      Message 2 of 6 , May 28 11:43 AM
      • 0 Attachment
        >
        >
        >
        >Since 'strlen(expr)' returns bytes, not characters - the result is
        >wrong when we use utf-8. If there is a way to count _characters_ (not
        >bytes)?
        >
        >Thank you.
        >
        >
        >
        >

        You could iterate over the line and only count bytes which are not
        continuing characters,
        which are those between 0x80 and 0xBF inclusive.
        perhaps a function called "character_count()" could be implemented...

        However, this will be misleading because it will not detect composing
        sequences,
        non-printing characters, zero width characters, invalid code points, the
        new and
        obnoxious formatting hint SOFT HYPHEN (which may or may not be visible
        depending upon the context), context sensitive glyphs which may change from
        whole characters to composed partion characters, etc, etc, etc...

        A function which supported all of that would be very complicated, would be
        dependant upon a specific version of the unicode standard, and may require
        support for various different language-specific-contexts (giving a different
        result depending upon which language the text is considered to be in).

        Anectdotally, most people say that an actual character count is not that
        interesting,
        when you consider the range of languages supported by unicode.
      • Bram Moolenaar
        ... Hmm, there are several methods, but I suppose the good old :s method ... This ignores composing characters. You can also use virtcol(), although it gets
        Message 3 of 6 , May 28 2:20 PM
        • 0 Attachment
          Valery Kondakoff wrote:

          > Tuesday, May 27, 2003, you wrote to me:
          >
          > >> > As far as I understand the character count is _not_ changed when
          > >> > converting, but wrong character count is displayed.
          >
          > BM> The byte count changes. Using "chars" in the file message is an old Vi
          > BM> habit. Look at this as the C language "char", which is actually a byte.
          >
          > Aha. Now I understand this. Thank you for explanation. But I still
          > don't understand how can I count _characters_, not bytes when using
          > 'utf-8'?

          Hmm, there are several methods, but I suppose the good old ":s" method
          works:

          :s/./&/g

          This ignores composing characters.

          You can also use virtcol(), although it gets confused by Tabs and wide
          characters.

          If you want a function you could use something like:

          strlen(substitute(getline(".")), ".", "x", "g")

          Not tested!

          --
          Often you're less important than your furniture. If you think about it, you
          can get fired but your furniture stays behind, gainfully employed at the
          company that didn't need _you_ anymore.
          (Scott Adams - The Dilbert principle)

          /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
          /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
          \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
          \\\ Help AIDS victims, buy at Amazon -- http://ICCF.nl/click1.html ///
        • Valery Kondakoff
          Hello, Bram! Thursday, May 29, 2003, you wrote to me: BM If you want a function you could use something like: BM strlen(substitute(getline( . )),
          Message 4 of 6 , May 29 6:02 AM
          • 0 Attachment
            Hello, Bram!

            Thursday, May 29, 2003, you wrote to me:


            BM> If you want a function you could use something like:
            BM> strlen(substitute(getline(".")), ".", "x", "g")
            BM> Not tested!

            Thank you. This example does exactly what I wanted:

            strlen(substitute(getline("."), ".", "x", "g"))

            Here is a modified function, which counts selected characters in
            Visual mode:

            " If the argument is 0, newlines are not counted in blockwise visual mode
            fun! VisualCount(newlines)
            let mode = mode()
            let ret = ""
            if mode ==? 'v'
            let opt_report = &report
            let &report = 2147483647 " 2^31 - 1
            norm "-ygv
            let &report = opt_report
            let len = strlen(substitute(@-, ".", "x", "g"))
            if mode ==# 'V' && a:newlines == 0
            let len = len - (line("'>") - line("'<")) - 1
            elseif mode ==# 'v' && a:newlines == 0
            let len = len - (line("'>") - line("'<"))
            endif
            else
            let len = 0
            endif
            if len > 0
            let ret = '[' . len . ']'
            endif
            return ret
            endfun


            --
            Best regards,
            Valery Kondakoff
            http://www.nbk.orc.ru (Ne Bey Kopytom)
            http://www.nbk.orc.ru/mtb (MTB riding in Moscow)

            PGP key: mailto:pgp-public-keys@...?subject=GET%20strauss@...

            np: Anthony Rother - Red Light District (Unknown) [stopped]
          Your message has been successfully submitted and would be delivered to recipients shortly.