Loading ...
Sorry, an error occurred while loading the content.
 

Re: Wrong characters count when using utf-8

Expand Messages
  • Bram Moolenaar
    ... As far as I can see there is no problem, it works as intended. ... 434 is the byte count of the original file, 441 is the byte count after conversion. In
    Message 1 of 6 , May 27, 2003
      Valery Kondakoff wrote:

      > Excuse me for bothering you again, but 'wrong character count when
      > using utf-8' issue is still reproducible in all the latest Gvim
      > (including 6.2f). Any comments? Do you think this issue is not
      > important?

      As far as I can see there is no problem, it works as intended.

      > There is my initial message:
      >
      > > It seems setting encoding to utf-8 breaks Gvim character counting.
      > > (Gvim 6.2c, WinXP Pro). There is an example: I created simple text
      > > file 'count.txt' (ff=dos). This file is 434 bytes long.
      > >
      > > When I load this file with 'encoding=utf-8' commented out in my
      > > 'vimrc' I see the following text on command line pannel: "count.txt"
      > > 1L, 434C. But if I uncomment the 'encoding=utf-8' line the output will
      > > be: "count.txt" [converted] 1L, 441C.

      434 is the byte count of the original file, 441 is the byte count after
      conversion. In UTF-8 some latin1 characters in your file take two bytes.

      > > As far as I understand the character count is _not_ changed when
      > > converting, but wrong character count is displayed.

      The byte count changes. Using "chars" in the file message is an old Vi
      habit. Look at this as the C language "char", which is actually a byte.

      > > And if I perform this action: '0v$' I will see the number '439' in
      > > the lower right Gvim corner. Another wrong value?

      It's 440 for me. The line break isn't counted for two characters here
      (internally a line break is just one NUL byte).

      --
      Scientists decoded the first message from an alien civilization:
      SIMPLY SEND 6 TIMES 10 TO THE 50 ATOMS OF HYDROGEN TO THE STAR
      SYSTEM AT THE TOP OF THE LIST, CROSS OFF THAT STAR SYSTEM, THEN PUT
      YOUR STAR SYSTEM AT THE BOTTOM OF THE LIST AND SEND IT TO 100 OTHER
      STAR SYSTEMS. WITHIN ONE TENTH GALACTIC ROTATION YOU WILL RECEIVE
      ENOUGH HYDROGREN TO POWER YOUR CIVILIZATION UNTIL ENTROPY REACHES ITS
      MAXIMUM! IT REALLY WORKS!

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
      \\\ Help AIDS victims, buy at Amazon -- http://ICCF.nl/click1.html ///
    • Valery Kondakoff
      Hello, Bram! ... BM The byte count changes. Using chars in the file message is an old Vi BM habit. Look at this as the C language char , which is
      Message 2 of 6 , May 28, 2003
        Hello, Bram!

        Tuesday, May 27, 2003, you wrote to me:

        >> > As far as I understand the character count is _not_ changed when
        >> > converting, but wrong character count is displayed.

        BM> The byte count changes. Using "chars" in the file message is an old Vi
        BM> habit. Look at this as the C language "char", which is actually a byte.

        Aha. Now I understand this. Thank you for explanation. But I still
        don't understand how can I count _characters_, not bytes when using
        'utf-8'?

        I was using this function to count characters selected in Visual mode
        before I switched to 'utf-8':

        " If the argument is 0, newlines are not counted in blockwise visual mode
        fun! VisualCount(newlines)
        let mode = mode()
        let ret = ""
        if mode ==? 'v' || mode ==# nr2char(22)
        let opt_report = &report
        let &report = 2147483647 " 2^31 - 1
        norm "-ygv
        let &report = opt_report
        let len = strlen(@-)
        if mode ==# nr2char(22) && a:newlines == 0
        let len = len - (line("'>") - line("'<"))
        endif
        else
        let len = 0
        endif
        if len > 0
        let ret = '[' . len . ']'
        endif
        return ret
        endfun

        Since 'strlen(expr)' returns bytes, not characters - the result is
        wrong when we use utf-8. If there is a way to count _characters_ (not
        bytes)?

        Thank you.


        --
        Best regards,
        Valery Kondakoff
        http://www.nbk.orc.ru (Ne Bey Kopytom)
        http://www.nbk.orc.ru/mtb (MTB riding in Moscow)

        PGP key: mailto:pgp-public-keys@...?subject=GET%20strauss@...
      • jmaiorana@idirect.net
        ... You could iterate over the line and only count bytes which are not continuing characters, which are those between 0x80 and 0xBF inclusive. perhaps a
        Message 3 of 6 , May 28, 2003
          >
          >
          >
          >Since 'strlen(expr)' returns bytes, not characters - the result is
          >wrong when we use utf-8. If there is a way to count _characters_ (not
          >bytes)?
          >
          >Thank you.
          >
          >
          >
          >

          You could iterate over the line and only count bytes which are not
          continuing characters,
          which are those between 0x80 and 0xBF inclusive.
          perhaps a function called "character_count()" could be implemented...

          However, this will be misleading because it will not detect composing
          sequences,
          non-printing characters, zero width characters, invalid code points, the
          new and
          obnoxious formatting hint SOFT HYPHEN (which may or may not be visible
          depending upon the context), context sensitive glyphs which may change from
          whole characters to composed partion characters, etc, etc, etc...

          A function which supported all of that would be very complicated, would be
          dependant upon a specific version of the unicode standard, and may require
          support for various different language-specific-contexts (giving a different
          result depending upon which language the text is considered to be in).

          Anectdotally, most people say that an actual character count is not that
          interesting,
          when you consider the range of languages supported by unicode.
        • Bram Moolenaar
          ... Hmm, there are several methods, but I suppose the good old :s method ... This ignores composing characters. You can also use virtcol(), although it gets
          Message 4 of 6 , May 28, 2003
            Valery Kondakoff wrote:

            > Tuesday, May 27, 2003, you wrote to me:
            >
            > >> > As far as I understand the character count is _not_ changed when
            > >> > converting, but wrong character count is displayed.
            >
            > BM> The byte count changes. Using "chars" in the file message is an old Vi
            > BM> habit. Look at this as the C language "char", which is actually a byte.
            >
            > Aha. Now I understand this. Thank you for explanation. But I still
            > don't understand how can I count _characters_, not bytes when using
            > 'utf-8'?

            Hmm, there are several methods, but I suppose the good old ":s" method
            works:

            :s/./&/g

            This ignores composing characters.

            You can also use virtcol(), although it gets confused by Tabs and wide
            characters.

            If you want a function you could use something like:

            strlen(substitute(getline(".")), ".", "x", "g")

            Not tested!

            --
            Often you're less important than your furniture. If you think about it, you
            can get fired but your furniture stays behind, gainfully employed at the
            company that didn't need _you_ anymore.
            (Scott Adams - The Dilbert principle)

            /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
            /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
            \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
            \\\ Help AIDS victims, buy at Amazon -- http://ICCF.nl/click1.html ///
          • Valery Kondakoff
            Hello, Bram! Thursday, May 29, 2003, you wrote to me: BM If you want a function you could use something like: BM strlen(substitute(getline( . )),
            Message 5 of 6 , May 29, 2003
              Hello, Bram!

              Thursday, May 29, 2003, you wrote to me:


              BM> If you want a function you could use something like:
              BM> strlen(substitute(getline(".")), ".", "x", "g")
              BM> Not tested!

              Thank you. This example does exactly what I wanted:

              strlen(substitute(getline("."), ".", "x", "g"))

              Here is a modified function, which counts selected characters in
              Visual mode:

              " If the argument is 0, newlines are not counted in blockwise visual mode
              fun! VisualCount(newlines)
              let mode = mode()
              let ret = ""
              if mode ==? 'v'
              let opt_report = &report
              let &report = 2147483647 " 2^31 - 1
              norm "-ygv
              let &report = opt_report
              let len = strlen(substitute(@-, ".", "x", "g"))
              if mode ==# 'V' && a:newlines == 0
              let len = len - (line("'>") - line("'<")) - 1
              elseif mode ==# 'v' && a:newlines == 0
              let len = len - (line("'>") - line("'<"))
              endif
              else
              let len = 0
              endif
              if len > 0
              let ret = '[' . len . ']'
              endif
              return ret
              endfun


              --
              Best regards,
              Valery Kondakoff
              http://www.nbk.orc.ru (Ne Bey Kopytom)
              http://www.nbk.orc.ru/mtb (MTB riding in Moscow)

              PGP key: mailto:pgp-public-keys@...?subject=GET%20strauss@...

              np: Anthony Rother - Red Light District (Unknown) [stopped]
            Your message has been successfully submitted and would be delivered to recipients shortly.