Loading ...
Sorry, an error occurred while loading the content.

Vim 6.3 bug: incorrect handling of utf-8 in files

Expand Messages
  • vassily ragosin
    Hello all, Problem: If encoding value is an 8 bit encoding, it may be not possible to edit utf-8 file ( fileencoding is utf-8) if the file contains symbols
    Message 1 of 10 , Feb 14, 2005
    • 0 Attachment
      Hello all,

      Problem: If 'encoding' value is an 8 bit encoding, it may be not possible to
      edit utf-8 file ('fileencoding' is utf-8) if the file contains symbols not
      found in 'encoding' charset. This makes it impossible to display and edit
      multilingual utf-8 files, such as Hebrew and Russian in such a way, that
      user could see it at least in the language 'encoding' works with.

      Solution: From iconv_open(3):

      iconv_t iconv_open (const char* tocode, const char* fromcode);
      ...
      When the string "//IGNORE" is appended to tocode, characters that cannot be
      represented in the target character set will be silently discarded.


      resourcefully yours,
      vassily

      mailto:vr[at]vrgraphics.ru
      pgp key id 0x92B4A97C
    • Antoine J. Mechelynck
      ... This is normal. encoding defines how characters are represented in memory. If encoding is set to an 8-bit encoding, only the 256 characters present in
      Message 2 of 10 , Feb 14, 2005
      • 0 Attachment
        vassily ragosin wrote:
        > Hello all,
        >
        > Problem: If 'encoding' value is an 8 bit encoding, it may be not possible to
        > edit utf-8 file ('fileencoding' is utf-8) if the file contains symbols not
        > found in 'encoding' charset. This makes it impossible to display and edit
        > multilingual utf-8 files, such as Hebrew and Russian in such a way, that
        > user could see it at least in the language 'encoding' works with.
        >
        > Solution: From iconv_open(3):
        >
        > iconv_t iconv_open (const char* tocode, const char* fromcode);
        > ...
        > When the string "//IGNORE" is appended to tocode, characters that cannot be
        > represented in the target character set will be silently discarded.
        >
        >
        > resourcefully yours,
        > vassily
        >
        > mailto:vr[at]vrgraphics.ru
        > pgp key id 0x92B4A97C
        >
        >
        >
        >
        >
        This is normal. 'encoding' defines how characters are represented in
        memory. If 'encoding' is set to an 8-bit encoding, only the 256
        characters present in that particular encoding can be represented in Vim
        memory; any other character cannot be represented.

        Whatever the file you want to edit, you have to set both 'fileencoding'
        (buffer-local option defining how the file's data is represented on
        disc) and 'encoding' (global option defining how the data is represented
        in memory) to some value compatible with what the file contains.

        If you want to edit multilingual text such as Russian mixed with Hebrew,
        Greek and Arabic, then you need a character set which includes all
        Cyrillic, Hebrew, Greek and Arabic letters. The only encodings which do
        are, AFAIK, the Unicode encodings.

        I you set 'encoding' to UTF-8 (and 'termencoding' to your keyboard's
        charset and _not_ to empty) you will be able to read Unicode files for
        editing. Depending on your 'guifont', you may or may not see all
        characters, because only the characters defined in your font can be
        shown; but irrespective of whether or not they are correctly displayed,
        all Unicode codepoints can be input (using ^Vuxxxx or, if a digraph is
        defined, ^Kaa) and ga in Normal mode will show the character under the
        cursor in alpha, decimal, octal and hex.

        See my tips and scripts on vim-online:
        Working with Unicode:
        http://vim.sourceforge.net/tips/tip.php?tip_id=246
        Orderly switching to Unicode:
        http://vim.sourceforge.net/scripts/script.php?script_id=789
        Setting the font in the GUI:
        http://vim.sourceforge.net/tips/tip.php?tip_id=632

        HTH,
        Tony.
      • Bram Moolenaar
        ... If you have an 8 bit encoding then it cannot represent more than 256 characters. Thus you can only use the first 256 symbols of an utf-8 file. That s the
        Message 3 of 10 , Feb 15, 2005
        • 0 Attachment
          Vassily Ragosin wrote:

          > Problem: If 'encoding' value is an 8 bit encoding, it may be not
          > possible to edit utf-8 file ('fileencoding' is utf-8) if the file
          > contains symbols not found in 'encoding' charset. This makes it
          > impossible to display and edit multilingual utf-8 files, such as
          > Hebrew and Russian in such a way, that user could see it at least in
          > the language 'encoding' works with.

          If you have an 8 bit encoding then it cannot represent more than 256
          characters. Thus you can only use the first 256 symbols of an utf-8
          file. That's the latin1 characters.

          > Solution: From iconv_open(3):
          >
          > iconv_t iconv_open (const char* tocode, const char* fromcode);
          > ...
          > When the string "//IGNORE" is appended to tocode, characters that cannot be
          > represented in the target character set will be silently discarded.

          What version of the iconv library is this with? It doesn't look very
          standard to me. Does it work on HPUX and Solaris? Might as well break
          conversion completely.

          --
          hundred-and-one symptoms of being an internet addict:
          98. The Alta Vista administrators ask you what sites are missing
          in their index files.

          /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
          /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
          \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
          \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
        • Alejandro L√≥pez-Valencia
          ... That would be GNU libiconv 1.19.x. I can t say anything about previous versions though. Yet, I wonder if glibc (as seen in Linux distros) and the APR iconv
          Message 4 of 10 , Feb 15, 2005
          • 0 Attachment
            Bram Moolenaar wrote:

            >Vassily Ragosin wrote:
            >
            >
            >
            >>Solution: From iconv_open(3):
            >>
            >>iconv_t iconv_open (const char* tocode, const char* fromcode);
            >>...
            >>When the string "//IGNORE" is appended to tocode, characters that cannot be
            >>represented in the target character set will be silently discarded.
            >>
            >>
            >
            >What version of the iconv library is this with? It doesn't look very
            >standard to me. Does it work on HPUX and Solaris? Might as well break
            >conversion completely.
            >
            >
            >
            That would be GNU libiconv 1.19.x. I can't say anything about previous
            versions though. Yet, I wonder if glibc (as seen in Linux distros) and
            the APR iconv library, which AFAIK is a fork of FreeBSD's iconv library
            and basically the same, would do such trick. Solaris and HP-UX, very
            doubtful, as far as my stroll through memory lane goes (too many years
            to confess... :-). On the embedded side, I know the iconv functions
            included in RedHat's newlib don't.
          • vassily ragosin
            Hello Bram, ... I understand that all I get in 8-bit encoding is 256 symbols (well, minus controls). What IS needed, however, is to make Vim display the
            Message 5 of 10 , Feb 15, 2005
            • 0 Attachment
              Hello Bram,

              > If you have an 8 bit encoding then it cannot represent more
              > than 256 characters. Thus you can only use the first 256
              > symbols of an utf-8 file. That's the latin1 characters.

              I understand that all I get in 8-bit encoding is 256 symbols (well, minus
              controls). What IS needed, however, is to make Vim display the maximum it
              can. You see, I have mbyte.rux utf-8 file translated, and it simply cannot
              be used in 8-bit environment, unless I take out all hebrew symbols there.
              Same with version6.rux, which uses Western (certain glyphs in proper names
              there) symbols along with Russian. Not good. Vim should be able to render
              Russian without problem and use spaces, '?' or whatever aproppriate instead
              of the glyphs it can't draw.

              Should I send you the offending file?

              > > Solution: From iconv_open(3):
              > >
              > > iconv_t iconv_open (const char* tocode, const char* fromcode); ...
              > > When the string "//IGNORE" is appended to tocode, characters that
              > > cannot be represented in the target character set will be
              > silently discarded.

              > What version of the iconv library is this with? It doesn't
              > look very standard to me. Does it work on HPUX and Solaris?
              > Might as well break conversion completely.

              According to iconv Changelog, the changes to allow using //IGNORE were
              introduced on 2002-01-13, not sure which version was that. I have no idea
              whether that works on other systems, but since you do some checks for
              iconv() anyway in Vim's code, you can check for EINVAL on iconv_open() call
              first with //IGNORE, and, if that fails, try doing it without //IGNORE, or
              there maybe some smarter way.

              resourcefully yours,
              vassily

              mailto:vr[at]vrgraphics.ru
              pgp key id 0x92B4A97C
            • vassily ragosin
              Hello Alejandro, ... I use FreeBSD libiconv from ports. resourcefully yours, vassily mailto:vr[at]vrgraphics.ru pgp key id 0x92B4A97C
              Message 6 of 10 , Feb 15, 2005
              • 0 Attachment
                Hello Alejandro,

                > That would be GNU libiconv 1.19.x. I can't say anything about
                > previous versions though. Yet, I wonder if glibc (as seen in
                > Linux distros) and the APR iconv library, which AFAIK is a
                > fork of FreeBSD's iconv library and basically the same, would
                > do such trick.

                I use FreeBSD libiconv from ports.

                resourcefully yours,
                vassily

                mailto:vr[at]vrgraphics.ru
                pgp key id 0x92B4A97C
              • vassily ragosin
                Hello Antoine, ... Wow! :) What is the reason for having an empty termencoding as a default? resourcefully yours, vassily mailto:vr[at]vrgraphics.ru pgp key
                Message 7 of 10 , Feb 15, 2005
                • 0 Attachment
                  Hello Antoine,

                  > I you set 'encoding' to UTF-8 (and 'termencoding' to your
                  > keyboard's charset and _not_ to empty)

                  Wow! :) What is the reason for having an empty 'termencoding' as a default?

                  resourcefully yours,
                  vassily

                  mailto:vr[at]vrgraphics.ru
                  pgp key id 0x92B4A97C
                • Antoine J. Mechelynck
                  ... termencoding empty means use the same value as encoding . At program startup, before sourcing your vimrc, encoding is set according to your locale;
                  Message 8 of 10 , Feb 15, 2005
                  • 0 Attachment
                    vassily ragosin wrote:
                    > Hello Antoine,
                    >
                    >
                    >>I you set 'encoding' to UTF-8 (and 'termencoding' to your
                    >>keyboard's charset and _not_ to empty)
                    >
                    >
                    > Wow! :) What is the reason for having an empty 'termencoding' as a default?
                    >
                    > resourcefully yours,
                    > vassily
                    >
                    > mailto:vr[at]vrgraphics.ru
                    > pgp key id 0x92B4A97C

                    'termencoding' empty means use the same value as 'encoding'. At program
                    startup, before sourcing your vimrc, 'encoding' is set according to your
                    locale; IOW, its default value is not hardcoded but depends on your OS's
                    language settings. Defaulting 'termencoding' to empty is OK as long as
                    you don't modify 'encoding'. If you do, setting 'termencoding' to
                    whatever 'encoding' was originally set to is usually enough, as follows:

                    if &tenc == ""
                    let &tenc = &enc
                    endif
                    set enc=utf-8

                    Best regards,
                    Tony.
                  • Bram Moolenaar
                    ... You simply can t expect non-latin1 characters to be displayed when you set encoding to latin1. Vim can display the latin1 characters, but from your
                    Message 9 of 10 , Feb 16, 2005
                    • 0 Attachment
                      Vassily Ragosin wrote:

                      > > If you have an 8 bit encoding then it cannot represent more
                      > > than 256 characters. Thus you can only use the first 256
                      > > symbols of an utf-8 file. That's the latin1 characters.
                      >
                      > I understand that all I get in 8-bit encoding is 256 symbols (well, minus
                      > controls). What IS needed, however, is to make Vim display the maximum it
                      > can. You see, I have mbyte.rux utf-8 file translated, and it simply cannot
                      > be used in 8-bit environment, unless I take out all hebrew symbols there.
                      > Same with version6.rux, which uses Western (certain glyphs in proper names
                      > there) symbols along with Russian. Not good. Vim should be able to render
                      > Russian without problem and use spaces, '?' or whatever aproppriate instead
                      > of the glyphs it can't draw.

                      You simply can't expect non-latin1 characters to be displayed when you
                      set 'encoding' to latin1. Vim can display the latin1 characters, but
                      from your remark it sounds like that doesn't happen? It works much
                      better when you set 'encoding' to some Russian encoding, of course.

                      > Should I send you the offending file?

                      That could be helpful in finding out what the actual problem is. But
                      first of all I need to know your setup, such as $LANG, 'encoding',
                      'termencoding' and 'helplang'.

                      Note that the translated help was setup to be used in an utf-8
                      environment. That is the ultimate solution for all these encoding
                      problems. But we can try to support a few other environments.

                      > > > Solution: From iconv_open(3):
                      > > >
                      > > > iconv_t iconv_open (const char* tocode, const char* fromcode); ...
                      > > > When the string "//IGNORE" is appended to tocode, characters that
                      > > > cannot be represented in the target character set will be
                      > > > silently discarded.
                      >
                      > > What version of the iconv library is this with? It doesn't
                      > > look very standard to me. Does it work on HPUX and Solaris?
                      > > Might as well break conversion completely.
                      >
                      > According to iconv Changelog, the changes to allow using //IGNORE were
                      > introduced on 2002-01-13, not sure which version was that. I have no idea
                      > whether that works on other systems, but since you do some checks for
                      > iconv() anyway in Vim's code, you can check for EINVAL on iconv_open() call
                      > first with //IGNORE, and, if that fails, try doing it without //IGNORE, or
                      > there maybe some smarter way.

                      That would help if it works. But when it doesn't work we still want to
                      show the text. Thus we need to make an alternate solution anyway. But
                      I thought this already happens: iconv_string() handles the situation
                      that a character can't be converted.

                      --
                      hundred-and-one symptoms of being an internet addict:
                      108. While reading a magazine, you look for the Zoom icon for a better
                      look at a photograph.

                      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                      /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
                      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                      \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
                    • vassily ragosin
                      Hello, Bram, ... well, I did send it some time ago, but haven t received any reply. Can you see if we can work around utf8 - 8bit conversion? For the time
                      Message 10 of 10 , Feb 28, 2005
                      • 0 Attachment
                        Hello, Bram,

                        > > Should I send you the offending file?
                        >
                        > That could be helpful in finding out what the actual problem
                        > is. But first of all I need to know your setup, such as
                        > $LANG, 'encoding', 'termencoding' and 'helplang'.
                        >
                        > Note that the translated help was setup to be used in an
                        > utf-8 environment. That is the ultimate solution for all
                        > these encoding problems. But we can try to support a few
                        > other environments.

                        well, I did send it some time ago, but haven't received any reply. Can you
                        see if we can work around utf8 -> 8bit conversion? For the time being it
                        blocks next release of Russian documentation, as some files will be unusable
                        in non-Unicode environments.

                        resourcefully yours,
                        vassily

                        mailto:vr[at]vrgraphics.ru
                        pgp key id 0x92B4A97C
                      Your message has been successfully submitted and would be delivered to recipients shortly.