Loading ...
Sorry, an error occurred while loading the content.

Wrong characters count when using utf-8

Expand Messages
  • Valery Kondakoff
    Hello, VIM-developers! Excuse me for bothering you again, but wrong character count when using utf-8 issue is still reproducible in all the latest Gvim
    Message 1 of 6 , May 27 6:38 AM
    • 0 Attachment
      Hello, VIM-developers!

      Excuse me for bothering you again, but 'wrong character count when
      using utf-8' issue is still reproducible in all the latest Gvim
      (including 6.2f). Any comments? Do you think this issue is not
      important?

      There is my initial message:

      > It seems setting encoding to utf-8 breaks Gvim character counting.
      > (Gvim 6.2c, WinXP Pro). There is an example: I created simple text
      > file 'count.txt' (ff=dos). This file is 434 bytes long.
      >
      > When I load this file with 'encoding=utf-8' commented out in my
      > 'vimrc' I see the following text on command line pannel: "count.txt"
      > 1L, 434C. But if I uncomment the 'encoding=utf-8' line the output will
      > be: "count.txt" [converted] 1L, 441C.
      >
      > As far as I understand the character count is _not_ changed when
      > converting, but wrong character count is displayed. And if I perform
      > this action: '0v$' I will see the number '439' in the lower right Gvim
      > corner. Another wrong value?
      >
      > I'm receiving the same result if I use 'WC.vim' script from vim.org
      > ('441 bytes' - two line ending bytes are added):
      >
      > function WC() range
      > echo line2byte(a:lastline+1)-line2byte(a:firstline) . " bytes"
      > endfunction
      >
      > Can you help me understand what's wrong there?
      >
      > Thank you. You can download the file in question to test the issue
      > ( http://www.nbk.orc.ru/temp/count.txt ). This file is in latin1
      > encoding.


      Thank you.

      --
      Best regards,
      Valery Kondakoff
      http://www.nbk.orc.ru (Ne Bey Kopytom)
      http://www.nbk.orc.ru/mtb (MTB riding in Moscow)

      PGP key: mailto:pgp-public-keys@...?subject=GET%20strauss@...

      np: Anthony Rother - Dude On The Street (Unknown) [stopped]
    • Bram Moolenaar
      ... As far as I can see there is no problem, it works as intended. ... 434 is the byte count of the original file, 441 is the byte count after conversion. In
      Message 2 of 6 , May 27 7:55 AM
      • 0 Attachment
        Valery Kondakoff wrote:

        > Excuse me for bothering you again, but 'wrong character count when
        > using utf-8' issue is still reproducible in all the latest Gvim
        > (including 6.2f). Any comments? Do you think this issue is not
        > important?

        As far as I can see there is no problem, it works as intended.

        > There is my initial message:
        >
        > > It seems setting encoding to utf-8 breaks Gvim character counting.
        > > (Gvim 6.2c, WinXP Pro). There is an example: I created simple text
        > > file 'count.txt' (ff=dos). This file is 434 bytes long.
        > >
        > > When I load this file with 'encoding=utf-8' commented out in my
        > > 'vimrc' I see the following text on command line pannel: "count.txt"
        > > 1L, 434C. But if I uncomment the 'encoding=utf-8' line the output will
        > > be: "count.txt" [converted] 1L, 441C.

        434 is the byte count of the original file, 441 is the byte count after
        conversion. In UTF-8 some latin1 characters in your file take two bytes.

        > > As far as I understand the character count is _not_ changed when
        > > converting, but wrong character count is displayed.

        The byte count changes. Using "chars" in the file message is an old Vi
        habit. Look at this as the C language "char", which is actually a byte.

        > > And if I perform this action: '0v$' I will see the number '439' in
        > > the lower right Gvim corner. Another wrong value?

        It's 440 for me. The line break isn't counted for two characters here
        (internally a line break is just one NUL byte).

        --
        Scientists decoded the first message from an alien civilization:
        SIMPLY SEND 6 TIMES 10 TO THE 50 ATOMS OF HYDROGEN TO THE STAR
        SYSTEM AT THE TOP OF THE LIST, CROSS OFF THAT STAR SYSTEM, THEN PUT
        YOUR STAR SYSTEM AT THE BOTTOM OF THE LIST AND SEND IT TO 100 OTHER
        STAR SYSTEMS. WITHIN ONE TENTH GALACTIC ROTATION YOU WILL RECEIVE
        ENOUGH HYDROGREN TO POWER YOUR CIVILIZATION UNTIL ENTROPY REACHES ITS
        MAXIMUM! IT REALLY WORKS!

        /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
        /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
        \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
        \\\ Help AIDS victims, buy at Amazon -- http://ICCF.nl/click1.html ///
      • Valery Kondakoff
        Hello, Bram! ... BM The byte count changes. Using chars in the file message is an old Vi BM habit. Look at this as the C language char , which is
        Message 3 of 6 , May 28 6:49 AM
        • 0 Attachment
          Hello, Bram!

          Tuesday, May 27, 2003, you wrote to me:

          >> > As far as I understand the character count is _not_ changed when
          >> > converting, but wrong character count is displayed.

          BM> The byte count changes. Using "chars" in the file message is an old Vi
          BM> habit. Look at this as the C language "char", which is actually a byte.

          Aha. Now I understand this. Thank you for explanation. But I still
          don't understand how can I count _characters_, not bytes when using
          'utf-8'?

          I was using this function to count characters selected in Visual mode
          before I switched to 'utf-8':

          " If the argument is 0, newlines are not counted in blockwise visual mode
          fun! VisualCount(newlines)
          let mode = mode()
          let ret = ""
          if mode ==? 'v' || mode ==# nr2char(22)
          let opt_report = &report
          let &report = 2147483647 " 2^31 - 1
          norm "-ygv
          let &report = opt_report
          let len = strlen(@-)
          if mode ==# nr2char(22) && a:newlines == 0
          let len = len - (line("'>") - line("'<"))
          endif
          else
          let len = 0
          endif
          if len > 0
          let ret = '[' . len . ']'
          endif
          return ret
          endfun

          Since 'strlen(expr)' returns bytes, not characters - the result is
          wrong when we use utf-8. If there is a way to count _characters_ (not
          bytes)?

          Thank you.


          --
          Best regards,
          Valery Kondakoff
          http://www.nbk.orc.ru (Ne Bey Kopytom)
          http://www.nbk.orc.ru/mtb (MTB riding in Moscow)

          PGP key: mailto:pgp-public-keys@...?subject=GET%20strauss@...
        • jmaiorana@idirect.net
          ... You could iterate over the line and only count bytes which are not continuing characters, which are those between 0x80 and 0xBF inclusive. perhaps a
          Message 4 of 6 , May 28 11:43 AM
          • 0 Attachment
            >
            >
            >
            >Since 'strlen(expr)' returns bytes, not characters - the result is
            >wrong when we use utf-8. If there is a way to count _characters_ (not
            >bytes)?
            >
            >Thank you.
            >
            >
            >
            >

            You could iterate over the line and only count bytes which are not
            continuing characters,
            which are those between 0x80 and 0xBF inclusive.
            perhaps a function called "character_count()" could be implemented...

            However, this will be misleading because it will not detect composing
            sequences,
            non-printing characters, zero width characters, invalid code points, the
            new and
            obnoxious formatting hint SOFT HYPHEN (which may or may not be visible
            depending upon the context), context sensitive glyphs which may change from
            whole characters to composed partion characters, etc, etc, etc...

            A function which supported all of that would be very complicated, would be
            dependant upon a specific version of the unicode standard, and may require
            support for various different language-specific-contexts (giving a different
            result depending upon which language the text is considered to be in).

            Anectdotally, most people say that an actual character count is not that
            interesting,
            when you consider the range of languages supported by unicode.
          • Bram Moolenaar
            ... Hmm, there are several methods, but I suppose the good old :s method ... This ignores composing characters. You can also use virtcol(), although it gets
            Message 5 of 6 , May 28 2:20 PM
            • 0 Attachment
              Valery Kondakoff wrote:

              > Tuesday, May 27, 2003, you wrote to me:
              >
              > >> > As far as I understand the character count is _not_ changed when
              > >> > converting, but wrong character count is displayed.
              >
              > BM> The byte count changes. Using "chars" in the file message is an old Vi
              > BM> habit. Look at this as the C language "char", which is actually a byte.
              >
              > Aha. Now I understand this. Thank you for explanation. But I still
              > don't understand how can I count _characters_, not bytes when using
              > 'utf-8'?

              Hmm, there are several methods, but I suppose the good old ":s" method
              works:

              :s/./&/g

              This ignores composing characters.

              You can also use virtcol(), although it gets confused by Tabs and wide
              characters.

              If you want a function you could use something like:

              strlen(substitute(getline(".")), ".", "x", "g")

              Not tested!

              --
              Often you're less important than your furniture. If you think about it, you
              can get fired but your furniture stays behind, gainfully employed at the
              company that didn't need _you_ anymore.
              (Scott Adams - The Dilbert principle)

              /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
              /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
              \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
              \\\ Help AIDS victims, buy at Amazon -- http://ICCF.nl/click1.html ///
            • Valery Kondakoff
              Hello, Bram! Thursday, May 29, 2003, you wrote to me: BM If you want a function you could use something like: BM strlen(substitute(getline( . )),
              Message 6 of 6 , May 29 6:02 AM
              • 0 Attachment
                Hello, Bram!

                Thursday, May 29, 2003, you wrote to me:


                BM> If you want a function you could use something like:
                BM> strlen(substitute(getline(".")), ".", "x", "g")
                BM> Not tested!

                Thank you. This example does exactly what I wanted:

                strlen(substitute(getline("."), ".", "x", "g"))

                Here is a modified function, which counts selected characters in
                Visual mode:

                " If the argument is 0, newlines are not counted in blockwise visual mode
                fun! VisualCount(newlines)
                let mode = mode()
                let ret = ""
                if mode ==? 'v'
                let opt_report = &report
                let &report = 2147483647 " 2^31 - 1
                norm "-ygv
                let &report = opt_report
                let len = strlen(substitute(@-, ".", "x", "g"))
                if mode ==# 'V' && a:newlines == 0
                let len = len - (line("'>") - line("'<")) - 1
                elseif mode ==# 'v' && a:newlines == 0
                let len = len - (line("'>") - line("'<"))
                endif
                else
                let len = 0
                endif
                if len > 0
                let ret = '[' . len . ']'
                endif
                return ret
                endfun


                --
                Best regards,
                Valery Kondakoff
                http://www.nbk.orc.ru (Ne Bey Kopytom)
                http://www.nbk.orc.ru/mtb (MTB riding in Moscow)

                PGP key: mailto:pgp-public-keys@...?subject=GET%20strauss@...

                np: Anthony Rother - Red Light District (Unknown) [stopped]
              Your message has been successfully submitted and would be delivered to recipients shortly.