Loading ...
Sorry, an error occurred while loading the content.

multibyte and 'encoding' and 'fileencoding'

Expand Messages
  • Benji Fisher
    I have recently started using vim on RH Linux 8.0, where $LANG is ste to en_US.UTF-8. I think this leads sometimes to a mis-match between the encoding and
    Message 1 of 12 , Dec 6, 2002
    • 0 Attachment
      I have recently started using vim on RH Linux 8.0, where $LANG is
      ste to en_US.UTF-8. I think this leads sometimes to a mis-match between
      the 'encoding' and 'fileencoding' options: one is utf-8 and the other
      is latin1.

      The specific problem I have is viewing the character with ASCII
      code (if that is the right term) 228 (or Hex e4). When 'encoding' is
      set to utf=8 and 'fileencoding' is set to latin1, this appears as a
      double-wide character, and the display is screwed up. As I bang on the
      h and l keys, characters move over to fill a phantom space, and <C-L>
      does not help.

      In terminal vim, both options are set to utf-8, and all is well.
      In gvim, I have added "set encoding=latin1" to my gvimrc file, and this
      seems to work, although I fear it will cause other problems. Is there a
      better solution?

      --Benji Fisher
    • Antoine J. Mechelynck
      Dear Benji, You may have to fiddle with the various encoding settings in vim: encoding , fileencoding , fileencodings , termencoding , etc. See (on
      Message 2 of 12 , Dec 6, 2002
      • 0 Attachment
        Dear Benji,

        You may have to fiddle with the various encoding settings in vim:
        'encoding', 'fileencoding', 'fileencodings', 'termencoding', etc. See (on
        Vim-online) the last section of the Vim FAQ, and my tip
        http://vim.sourceforge.net/tip_view.php?tip_id=246 . Basically, if the file
        is recognized as UTF-8, a 0xE4 byte should never appear by itself (i.e.,
        between bytes < 128). (Codepoints above U+007F are represented in UTF-8 by a
        sequence of two or more bytes. See the Unicode site http://www.unicode.org/
        for technical info about Unicode).

        There are basically two ways to go about this, and they are not exclusive of
        each other:

        - Set up the 'fileencodings' heuristics so that all the encodings that
        you commonly use be correctly recognised.
        - Add a modeline (:help modeline) to the file, with a 'fileencoding'
        setting in it, so that strange cases be nevertheless treated properly. For
        best results and maximum compatibility, the modeline itself should be in
        us-ascii (text characters between 0x20 and 0x7E inclusive).

        HTH,
        Tony.

        Benji Fisher <benji@...> wrote:
        > I have recently started using vim on RH Linux 8.0, where $LANG is
        > ste to en_US.UTF-8. I think this leads sometimes to a mis-match between
        > the 'encoding' and 'fileencoding' options: one is utf-8 and the other
        > is latin1.
        >
        > The specific problem I have is viewing the character with ASCII
        > code (if that is the right term) 228 (or Hex e4). When 'encoding' is
        > set to utf=8 and 'fileencoding' is set to latin1, this appears as a
        > double-wide character, and the display is screwed up. As I bang on the
        > h and l keys, characters move over to fill a phantom space, and <C-L>
        > does not help.
        >
        > In terminal vim, both options are set to utf-8, and all is well.
        > In gvim, I have added "set encoding=latin1" to my gvimrc file, and this
        > seems to work, although I fear it will cause other problems. Is there a
        > better solution?
        >
        > --Benji Fisher
      • Benji Fisher
        ... Tony: Thanks for the help. I tried following the tip. If encoding is set to utf-8 and fileencoding gets set to latin1 or to one of the iso- settings,
        Message 3 of 12 , Dec 6, 2002
        • 0 Attachment
          Antoine J. Mechelynck wrote:
          > Dear Benji,
          >
          > You may have to fiddle with the various encoding settings in vim:
          > 'encoding', 'fileencoding', 'fileencodings', 'termencoding', etc. See (on
          > Vim-online) the last section of the Vim FAQ, and my tip
          > http://vim.sourceforge.net/tip_view.php?tip_id=246 . Basically, if the file
          > is recognized as UTF-8, a 0xE4 byte should never appear by itself (i.e.,
          > between bytes < 128). (Codepoints above U+007F are represented in UTF-8 by a
          > sequence of two or more bytes. See the Unicode site http://www.unicode.org/
          > for technical info about Unicode).

          Tony:

          Thanks for the help. I tried following the tip. If 'encoding' is
          set to utf-8 and 'fileencoding' gets set to latin1 or to one of the iso-
          settings, the problem is still there. Is this a bug? I know that the
          file was originally written in latin1. OTOH, if I leave 'fenc' empty,
          or set it to utf-8 (along with 'enc') then the character displays as
          "<e4>"--not pretty, but it works.

          > There are basically two ways to go about this, and they are not exclusive of
          > each other:
          >
          > - Set up the 'fileencodings' heuristics so that all the encodings that
          > you commonly use be correctly recognised.
          > - Add a modeline (:help modeline) to the file, with a 'fileencoding'
          > setting in it, so that strange cases be nevertheless treated properly. For
          > best results and maximum compatibility, the modeline itself should be in
          > us-ascii (text characters between 0x20 and 0x7E inclusive).
          >
          > HTH,
          > Tony.
          >
          > Benji Fisher <benji@...> wrote:
          >
          >> I have recently started using vim on RH Linux 8.0, where $LANG is
          >>ste to en_US.UTF-8. I think this leads sometimes to a mis-match between
          >>the 'encoding' and 'fileencoding' options: one is utf-8 and the other
          >>is latin1.
          >>
          >> The specific problem I have is viewing the character with ASCII
          >>code (if that is the right term) 228 (or Hex e4). When 'encoding' is
          >>set to utf=8 and 'fileencoding' is set to latin1, this appears as a
          >>double-wide character, and the display is screwed up. As I bang on the
          >>h and l keys, characters move over to fill a phantom space, and <C-L>
          >>does not help.
          >>
          >> In terminal vim, both options are set to utf-8, and all is well.
          >>In gvim, I have added "set encoding=latin1" to my gvimrc file, and this
          >>seems to work, although I fear it will cause other problems. Is there a
          >>better solution?
          >>
          >>--Benji Fisher
          >
          >
          >
          >
        • Antoine J. Mechelynck
          Benji Fisher wrote: [...] ... [...] When did you set enc=latin1 ? If you do it after reading the file, (g)vim will try to translate it
          Message 4 of 12 , Dec 6, 2002
          • 0 Attachment
            Benji Fisher <benji@...> wrote:
            [...]
            > Tony:
            >
            > Thanks for the help. I tried following the tip. If 'encoding' is
            > set to utf-8 and 'fileencoding' gets set to latin1 or to one of the iso-
            > settings, the problem is still there. Is this a bug? I know that the
            > file was originally written in latin1. OTOH, if I leave 'fenc' empty,
            > or set it to utf-8 (along with 'enc') then the character displays as
            > "<e4>"--not pretty, but it works.
            [...]

            When did you set enc=latin1 ? If you do it after reading the file, (g)vim
            will try to translate it from one encoding to the other. OTOH, if you do it
            before reading the file, then when you open the file (g)vim will try to
            guess its encoding according to 'fileencodings' (with s) and it will alter
            'fileencoding' (without s) as a result. Damned if we do, damned if we don't.

            Since you know the file is latin1, have you tried loading it with ":edit
            filename ++enc=latin1"? (no quotes of course) -- that ought to read the file
            as latin1 with no translation (++enc overrides 'fenc' for the file in
            question). Or you might set 'encoding' to latin1 just before editing that
            file if you know that you will never need to put non-latin1 characters in
            it. (Remember that if you change 'encoding' you might, or might not, need to
            do :let &termencoding = &encoding just before the change.)

            Tony.
          • Benji Fisher
            ... I have tried setting enc and fenc at various times, before and after loading the file, and with ... I can get various combinations; that is not the
            Message 5 of 12 , Dec 6, 2002
            • 0 Attachment
              Antoine J. Mechelynck wrote:
              >
              > When did you set enc=latin1 ? If you do it after reading the file, (g)vim
              > will try to translate it from one encoding to the other. OTOH, if you do it
              > before reading the file, then when you open the file (g)vim will try to
              > guess its encoding according to 'fileencodings' (with s) and it will alter
              > 'fileencoding' (without s) as a result. Damned if we do, damned if we don't.
              >
              > Since you know the file is latin1, have you tried loading it with ":edit
              > filename ++enc=latin1"? (no quotes of course) -- that ought to read the file
              > as latin1 with no translation (++enc overrides 'fenc' for the file in
              > question). Or you might set 'encoding' to latin1 just before editing that
              > file if you know that you will never need to put non-latin1 characters in
              > it. (Remember that if you change 'encoding' you might, or might not, need to
              > do :let &termencoding = &encoding just before the change.)

              I have tried setting 'enc' and 'fenc' at various times, before and
              after loading the file, and with

              :e ++enc=latin1 file

              I can get various combinations; that is not the problem. However I do
              it, the display is messed up if 'encoding' is set to utf-8 and
              'fileencoding' is set to latin1. The most satisfactory solution seems
              to be to set 'encoding' to latin1 (before or after loading the file) and
              either leave 'fenc' empty or set it to latin1.

              I still think there is a bug here. With the default values

              enc=utf-8 " set from $LANG
              fenc=
              fencs=
              tenc=

              the display of the entire line after the <e4> is messed up. Same thing
              if I set 'fenc' to latin1. If I set 'fencs', say to utf-8 or something
              more complicated) then only a single character after the <e4> is
              affected--even though 'fenc' gets set to the same thing!

              --Benji Fisher
            • Antoine J. Mechelynck
              Benji Fisher wrote: [...] ... Well, I told you all I thought of. If we don t get Bram s attention, then maybe you should post your
              Message 6 of 12 , Dec 8, 2002
              • 0 Attachment
                Benji Fisher <benji@...> wrote:
                [...]
                >
                > I have tried setting 'enc' and 'fenc' at various times, before and
                > after loading the file, and with
                >
                > > e ++enc=latin1 file
                >
                > I can get various combinations; that is not the problem. However I do
                > it, the display is messed up if 'encoding' is set to utf-8 and
                > 'fileencoding' is set to latin1. The most satisfactory solution seems
                > to be to set 'encoding' to latin1 (before or after loading the file) and
                > either leave 'fenc' empty or set it to latin1.
                >
                > I still think there is a bug here. With the default values
                >
                > enc=utf-8 " set from $LANG
                > fenc=
                > fencs=
                > tenc=
                >
                > the display of the entire line after the <e4> is messed up. Same thing
                > if I set 'fenc' to latin1. If I set 'fencs', say to utf-8 or something
                > more complicated) then only a single character after the <e4> is
                > affected--even though 'fenc' gets set to the same thing!
                >
                > --Benji Fisher

                Well, I told you all I thought of. If we don't get Bram's attention, then
                maybe you should post your problem in vim-multibyte.

                Tony.
              • Bram Moolenaar
                ... An UTF-8 file should never contain a byte that isn t followed by another byte in the range 0x80 - 0xaf. Having encoding set to utf-8 should be
                Message 7 of 12 , Dec 31, 2002
                • 0 Attachment
                  Benji Fisher wrote:

                  > I have tried setting 'enc' and 'fenc' at various times, before and
                  > after loading the file, and with
                  >
                  > :e ++enc=latin1 file
                  >
                  > I can get various combinations; that is not the problem. However I do
                  > it, the display is messed up if 'encoding' is set to utf-8 and
                  > 'fileencoding' is set to latin1. The most satisfactory solution seems
                  > to be to set 'encoding' to latin1 (before or after loading the file) and
                  > either leave 'fenc' empty or set it to latin1.
                  >
                  > I still think there is a bug here. With the default values
                  >
                  > enc=utf-8 " set from $LANG
                  > fenc=
                  > fencs=
                  > tenc=
                  >
                  > the display of the entire line after the <e4> is messed up. Same thing
                  > if I set 'fenc' to latin1. If I set 'fencs', say to utf-8 or something
                  > more complicated) then only a single character after the <e4> is
                  > affected--even though 'fenc' gets set to the same thing!

                  An UTF-8 file should never contain a <e4> byte that isn't followed by
                  another byte in the range 0x80 - 0xaf.

                  Having 'encoding' set to "utf-8" should be just fine. Vim should be
                  able to detect if an input file is latin1 or utf-8 and convert it
                  automaticallly.

                  Note that setting 'fenc' only makes sense when you are about to write
                  the file, it doesn't change anything for the next edit command. Change
                  'fileencodings' for that.

                  You can force a file to be edited as latin1 with:

                  :e ++enc=latin1 filename

                  Does this still result in a messed up display? Then send me a copy of
                  the file and I'll have a look.

                  --
                  hundred-and-one symptoms of being an internet addict:
                  250. You've given up the search for the "perfect woman" and instead,
                  sit in front of the PC until you're just too tired to care.

                  /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
                  /// Creator of Vim - Vi IMproved -- http://www.vim.org \\\
                  \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
                  \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
                • Benji Fisher
                  ... I finally decided that the display problem is caused by a font problem. I assume you will get to my note on this in due course. I am surprised that a file
                  Message 8 of 12 , Dec 31, 2002
                  • 0 Attachment
                    Bram Moolenaar wrote:
                    > Benji Fisher wrote:
                    >
                    >
                    >> I have tried setting 'enc' and 'fenc' at various times, before and
                    >>after loading the file, and with
                    >>
                    >>:e ++enc=latin1 file
                    >>
                    >>I can get various combinations; that is not the problem. However I do
                    >>it, the display is messed up if 'encoding' is set to utf-8 and
                    >>'fileencoding' is set to latin1. The most satisfactory solution seems
                    >>to be to set 'encoding' to latin1 (before or after loading the file) and
                    >>either leave 'fenc' empty or set it to latin1.
                    >>
                    >> I still think there is a bug here. With the default values
                    >>
                    >>enc=utf-8 " set from $LANG
                    >>fenc=
                    >>fencs=
                    >>tenc=
                    >>
                    >>the display of the entire line after the <e4> is messed up. Same thing
                    >>if I set 'fenc' to latin1. If I set 'fencs', say to utf-8 or something
                    >>more complicated) then only a single character after the <e4> is
                    >>affected--even though 'fenc' gets set to the same thing!
                    >
                    >
                    > An UTF-8 file should never contain a <e4> byte that isn't followed by
                    > another byte in the range 0x80 - 0xaf.
                    >
                    > Having 'encoding' set to "utf-8" should be just fine. Vim should be
                    > able to detect if an input file is latin1 or utf-8 and convert it
                    > automaticallly.
                    >
                    > Note that setting 'fenc' only makes sense when you are about to write
                    > the file, it doesn't change anything for the next edit command. Change
                    > 'fileencodings' for that.
                    >
                    > You can force a file to be edited as latin1 with:
                    >
                    > :e ++enc=latin1 filename
                    >
                    > Does this still result in a messed up display? Then send me a copy of
                    > the file and I'll have a look.

                    I finally decided that the display problem is caused by a font
                    problem. I assume you will get to my note on this in due course.

                    I am surprised that a file is not supposed to contain a raw
                    "\xe4". What is to stop me from doing

                    :put=\"xe4\"

                    --Benji Fisher
                  • Bram Moolenaar
                    ... Glad it s not a Vim problem. ... You can, but the result is an invalid byte. Vim uses invalid bytes to recognize that a file is NOT utf-8, thus you get
                    Message 9 of 12 , Dec 31, 2002
                    • 0 Attachment
                      Benji Fisher wrote:

                      > I finally decided that the display problem is caused by a font
                      > problem. I assume you will get to my note on this in due course.

                      Glad it's not a Vim problem.

                      > I am surprised that a file is not supposed to contain a raw
                      > "\xe4". What is to stop me from doing
                      >
                      > :put=\"xe4\"

                      You can, but the result is an invalid byte. Vim uses invalid bytes to
                      recognize that a file is NOT utf-8, thus you get yourself into trouble.

                      --
                      hundred-and-one symptoms of being an internet addict:
                      263. You have more e-mail addresses than shorts.

                      /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
                      /// Creator of Vim - Vi IMproved -- http://www.vim.org \\\
                      \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
                      \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
                    • Antoine J. Mechelynck
                      Benji Fisher wrote: [...] ... If your file is in UTF-8, then obviously it must obey UTF-8 encoding rules; and these rules say (among
                      Message 10 of 12 , Jan 1, 2003
                      • 0 Attachment
                        Benji Fisher <benji@...> wrote:
                        [...]
                        > I am surprised that a file is not supposed to contain a raw
                        > "\xe4". What is to stop me from doing
                        >
                        > > put=\"xe4\"
                        >
                        > --Benji Fisher

                        If your file is in UTF-8, then obviously it must obey UTF-8 encoding rules;
                        and these rules say (among other things) that:

                        - Codepoints from 0000 to 007F are compatible with us-ascii and are
                        encoded as one byte, with high bit off
                        - Codepoints from 0080 upwards are encoded as a string of 2 or more
                        bytes; the first of those is greater than 0xC0, the other(s) lie in the
                        range 0x80-0xBF. The number of highbits in the first byte determines the
                        number of following bytes

                        So there is a strict separation between single-bytes (0x00-0x7F),
                        first-bytes (0xC0-0xFF, and not all values in that range are legal) and
                        following-bytes (0x80-0xBF) to avoid context ambiguity.

                        Details can be found somewhere on the Unicode site, whose entry page is at
                        http://www.unicode.org/ . And don't forget that if 'encoding' is set to
                        utf-8, then all files will be internally represented as UTF-8 while editing,
                        with translation when reading or writing non-UTF-8 files. So typing (in
                        Insert mode) Ctrl-V followed by xE4 will enter the 00E4 codepoint into
                        memory as two bytes, 0xC3 0xA4, but show it as one character, small a with
                        umlaut; and pressing x once in Normal mode with the cursor on that
                        chatracter deletes both bytes.

                        HTH,
                        Tony.
                      • Benji Fisher
                        ... Thanks for the details. That s already more than I think I need to know (for now) so I am not going to follow the link. Perhaps my :put command should
                        Message 11 of 12 , Jan 1, 2003
                        • 0 Attachment
                          Antoine J. Mechelynck wrote:
                          > Benji Fisher <benji@...> wrote:
                          > [...]
                          >
                          >> I am surprised that a file is not supposed to contain a raw
                          >>"\xe4". What is to stop me from doing
                          >>
                          >>
                          >>>put=\"xe4\"
                          >>
                          >>--Benji Fisher
                          >
                          >
                          > If your file is in UTF-8, then obviously it must obey UTF-8 encoding rules;
                          > and these rules say (among other things) that:
                          >
                          > - Codepoints from 0000 to 007F are compatible with us-ascii and are
                          > encoded as one byte, with high bit off
                          > - Codepoints from 0080 upwards are encoded as a string of 2 or more
                          > bytes; the first of those is greater than 0xC0, the other(s) lie in the
                          > range 0x80-0xBF. The number of highbits in the first byte determines the
                          > number of following bytes
                          >
                          > So there is a strict separation between single-bytes (0x00-0x7F),
                          > first-bytes (0xC0-0xFF, and not all values in that range are legal) and
                          > following-bytes (0x80-0xBF) to avoid context ambiguity.
                          >
                          > Details can be found somewhere on the Unicode site, whose entry page is at
                          > http://www.unicode.org/ . And don't forget that if 'encoding' is set to
                          > utf-8, then all files will be internally represented as UTF-8 while editing,
                          > with translation when reading or writing non-UTF-8 files. So typing (in
                          > Insert mode) Ctrl-V followed by xE4 will enter the 00E4 codepoint into
                          > memory as two bytes, 0xC3 0xA4, but show it as one character, small a with
                          > umlaut; and pressing x once in Normal mode with the cursor on that
                          > chatracter deletes both bytes.

                          Thanks for the details. That's already more than I think I need
                          to know (for now) so I am not going to follow the link.

                          Perhaps my :put command should also insert the 00E4 codepoint, the
                          same as <C-V>xE4 in Insert mode.
                          <later>
                          On another thread (multibyte in patterns) Bram suggests a new "\uab"
                          instead of "\xab". Maybe that is the way to go...

                          --Benji Fisher
                        • Antoine J. Mechelynck
                          Benji Fisher wrote: [...] ... That would, if done correctly, avoid putting invalid byte-sequences into UTF-8 files. ... I saw that
                          Message 12 of 12 , Jan 1, 2003
                          • 0 Attachment
                            Benji Fisher <benji@...> wrote:
                            [...]
                            > Perhaps my :put command should also insert the 00E4 codepoint, the
                            > same as <C-V>xE4 in Insert mode.

                            That would, if done correctly, avoid putting invalid byte-sequences into
                            UTF-8 files.

                            > <later>
                            > On another thread (multibyte in patterns) Bram suggests a new "\uab"
                            > instead of "\xab". Maybe that is the way to go...

                            I saw that message from Bram, and noticed a patch that went with it. I think
                            it's a good idea; but since I lack a vim-compile facility, I shall wait
                            until it is incorporated into a (supposedly stable) binary distribution. (At
                            the moment I am using gvim 6.1.243 +win32 +ole.)

                            >
                            > --Benji Fisher

                            Tony.
                          Your message has been successfully submitted and would be delivered to recipients shortly.