Loading ...
Sorry, an error occurred while loading the content.

Unexpected behavior loading cp1252 file as latin1

Expand Messages
  • Ben Fritz
    I have a file which if read with the Windows-1252 encoding (cp1252 in Vim) has an en dash character (encoded as byte 150). When I load this file in a Vim with
    Message 1 of 13 , Jan 20, 2011
    • 0 Attachment
      I have a file which if read with the Windows-1252 encoding (cp1252 in
      Vim) has an en dash character (encoded as byte 150). When I load this
      file in a Vim with enc=latin1, and leave fenc blank, I would expect to
      see a "no character" block in place of the en dash. However, I see the
      en dash as if I loaded with enc/fenc set to cp1252.

      If I set encoding to utf-8, and load the same file with default
      fileencodings, it detects as latin1 and I see the "no character" glyph
      as expected. If I do :e ++enc=cp1252, or if I modify my fileencodings
      option to include cp1252 instead of latin1, I see the en dash, again
      as expected.

      Is this behavior intentional? It certainly could be considered
      helpful, but it was very unexpected.

      --
      You received this message from the "vim_dev" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php
    • Ben Fritz
      ... So, is this expected behavior? Are there special rules when Vim s encoding is an 8-bit one? -- You received this message from the vim_dev maillist. Do
      Message 2 of 13 , Jan 28, 2011
      • 0 Attachment
        On Jan 20, 10:03 pm, Ben Fritz <fritzophre...@...> wrote:
        > I have a file which if read with the Windows-1252 encoding (cp1252in
        > Vim) has an en dash character (encoded as byte 150). When I load this
        > file in a Vim with enc=latin1, and leave fenc blank, I would expect to
        > see a "no character" block in place of the en dash. However, I see the
        > en dash as if I loaded with enc/fenc set tocp1252.
        >
        > If I set encoding to utf-8, and load the same file with default
        > fileencodings, it detects as latin1 and I see the "no character" glyph
        > as expected. If I do :e ++enc=cp1252, or if I modify my fileencodings
        > option to includecp1252instead of latin1, I see the en dash, again
        > as expected.
        >
        > Is this behavior intentional? It certainly could be considered
        > helpful, but it was very unexpected.

        So, is this expected behavior? Are there special rules when Vim's
        encoding is an 8-bit one?

        --
        You received this message from the "vim_dev" maillist.
        Do not top-post! Type your reply below the text you are replying to.
        For more information, visit http://www.vim.org/maillist.php
      • James Vega
        ... I remember trying to reproduce your described behavior when I first saw your mail but wasn t able to. Could you give a minimal set of steps along with the
        Message 3 of 13 , Jan 28, 2011
        • 0 Attachment
          On Fri, Jan 28, 2011 at 3:03 PM, Ben Fritz <fritzophrenic@...> wrote:
          > On Jan 20, 10:03 pm, Ben Fritz <fritzophre...@...> wrote:
          >> I have a file which if read with the Windows-1252 encoding (cp1252in
          >> Vim) has an en dash character (encoded as byte 150). When I load this
          >> file in a Vim with enc=latin1, and leave fenc blank, I would expect to
          >> see a "no character" block in place of the en dash. However, I see the
          >> en dash as if I loaded with enc/fenc set tocp1252.
          >>
          >> If I set encoding to utf-8, and load the same file with default
          >> fileencodings, it detects as latin1 and I see the "no character" glyph
          >> as expected. If I do :e ++enc=cp1252, or if I modify my fileencodings
          >> option to includecp1252instead of latin1, I see the en dash, again
          >> as expected.
          >>
          >> Is this behavior intentional? It certainly could be considered
          >> helpful, but it was very unexpected.
          >
          > So, is this expected behavior? Are there special rules when Vim's
          > encoding is an 8-bit one?

          I remember trying to reproduce your described behavior when I first saw
          your mail but wasn't able to. Could you give a minimal set of steps
          along with the output that you're seeing for ":set enc? fenc? fencs?" in
          each of the different cases?

          --
          James
          GPG Key: 1024D/61326D40 2003-09-02 James Vega <jamessan@...>

          --
          You received this message from the "vim_dev" maillist.
          Do not top-post! Type your reply below the text you are replying to.
          For more information, visit http://www.vim.org/maillist.php
        • Benjamin Fritz
          ... Running on Windows XP with the latest cream without Vim build (7.3.107 Huge). gvim -N -u NONE -i NONE ... encoding=cp1252 fileencoding=
          Message 4 of 13 , Jan 28, 2011
          • 0 Attachment
            On Fri, Jan 28, 2011 at 2:14 PM, James Vega <jamessan@...> wrote:
            >
            > I remember trying to reproduce your described behavior when I first saw
            > your mail but wasn't able to.  Could you give a minimal set of steps
            > along with the output that you're seeing for ":set enc? fenc? fencs?" in
            > each of the different cases?
            >

            Running on Windows XP with the latest "cream without Vim" build (7.3.107 Huge).

            gvim -N -u NONE -i NONE
            :set enc=cp1252
            :set enc? fenc? fencs?
            encoding=cp1252
            fileencoding=
            fileencodings=ucs-bom
            :set guifont=* (and select a font with a glyph for an en dash, in my
            case Deja Vu Sans Mono)
            i<C-K>-N<Esc>
            :saveas test.txt | q

            gvim -N -u NONE -i NONE
            :e test.txt
            :set enc? fenc? fencs?
            encoding=latin1
            fileencoding=
            fileencodings=ucs-bom
            :set guifont=* (and select a font with a glyph for an en dash, in my
            case Deja Vu Sans Mono)
            (here I see a single en dash character, even though latin1 does not
            have this character)
            :q

            gvim -N -u NONE -i NONE
            :set enc=utf-8
            :set enc? fenc? fencs?
            encoding=utf-8
            fileencoding=
            fileencodings=ucs-bom,utf-8,default,latin1
            :e test.txt
            :set guifont=* (and select a font with a glyph for an en dash, in my
            case Deja Vu Sans Mono)
            ("converted" message displayed, buffer displayes a "bad char" blank
            box as I would expect)
            :set enc? fenc? fencs?
            encoding=utf-8
            fileencoding=latin1
            fileencodings=ucs-bom,utf-8,default,latin1
            :e ++enc=cp1252
            ("converted" message displayed, buffer displays an en dash, as expected)
            :set enc? fenc? fencs?
            encoding=utf-8
            fileencoding=cp1252
            fileencodings=ucs-bom,utf-8,default,latin1
            :q

            --
            You received this message from the "vim_dev" maillist.
            Do not top-post! Type your reply below the text you are replying to.
            For more information, visit http://www.vim.org/maillist.php
          • Vlad Irnov
            ... It s not just en-dash. It also happens with adjacent cp1252 characters: fat middle dot (decimal 149), fancy quotes. Vim apparently uses cp1252 instead of
            Message 5 of 13 , Jan 28, 2011
            • 0 Attachment
              On Jan 20, 11:03 pm, Ben Fritz <fritzophre...@...> wrote:
              > I have a file which if read with the Windows-1252 encoding (cp1252 in
              > Vim) has an en dash character (encoded as byte 150). When I load this
              > file in a Vim with enc=latin1, and leave fenc blank, I would expect to
              > see a "no character" block in place of the en dash. However, I see the
              > en dash as if I loaded with enc/fenc set to cp1252.
              >
              > If I set encoding to utf-8, and load the same file with default
              > fileencodings, it detects as latin1 and I see the "no character" glyph
              > as expected. If I do :e ++enc=cp1252, or if I modify my fileencodings
              > option to include cp1252 instead of latin1, I see the en dash, again
              > as expected.
              >
              > Is this behavior intentional? It certainly could be considered
              > helpful, but it was very unexpected.

              It's not just en-dash. It also happens with adjacent cp1252
              characters:
              fat middle dot (decimal 149), fancy quotes.

              Vim apparently uses cp1252 instead of latin-1 for &enc. My
              understanding is
              that the only difference between them is that cp1252 has characters
              for bytes
              128-159 while latin-1 uses them as control characters.

              According to http://en.wikipedia.org/wiki/Windows-1252
              "It is very common to mislabel Windows-1252 text data with the charset
              label ISO-8859-1."

              If you need these chars why not use cp1252 or Unicode and forget about
              latin1.

              --
              You received this message from the "vim_dev" maillist.
              Do not top-post! Type your reply below the text you are replying to.
              For more information, visit http://www.vim.org/maillist.php
            • Christian Brabandt
              Hi Vlad! ... Yes, if I read mbyte.c correctly, vim assumes cp1252 and uses this encoding for latin1 and iso8859-1 regards, Christian -- You received this
              Message 6 of 13 , Jan 29, 2011
              • 0 Attachment
                Hi Vlad!

                On Fr, 28 Jan 2011, Vlad Irnov wrote:

                > Vim apparently uses cp1252 instead of latin-1 for &enc. My
                > understanding is
                > that the only difference between them is that cp1252 has characters
                > for bytes
                > 128-159 while latin-1 uses them as control characters.

                Yes, if I read mbyte.c correctly, vim assumes cp1252 and uses this
                encoding for latin1 and iso8859-1

                regards,
                Christian

                --
                You received this message from the "vim_dev" maillist.
                Do not top-post! Type your reply below the text you are replying to.
                For more information, visit http://www.vim.org/maillist.php
              • Ben Fritz
                ... If this is true, I think the :help should mention it. As Vlad mentions, this practice is fairly common so I don t think it will be surprising to anyone as
                Message 7 of 13 , Jan 31, 2011
                • 0 Attachment
                  On Jan 29, 6:46 am, Christian Brabandt <cbli...@...> wrote:
                  > Hi Vlad!
                  >
                  > On Fr, 28 Jan 2011, Vlad Irnov wrote:
                  >
                  > > Vim apparently uses cp1252 instead of latin-1 for &enc. My
                  > > understanding is
                  > > that the only difference between them is that cp1252 has characters
                  > > for bytes
                  > > 128-159 while latin-1 uses them as control characters.
                  >
                  > Yes, if I read mbyte.c correctly, vim assumes cp1252 and uses this
                  > encoding for latin1 and iso8859-1
                  >

                  If this is true, I think the :help should mention it. As Vlad
                  mentions, this practice is fairly common so I don't think it will be
                  surprising to anyone as long as it is documented. I wonder though, if
                  pretending to be in latin1 but really being in cp1252 might cause more
                  headaches than it solves.

                  Note that this is the case for 'encoding' but not for 'fileencoding'.
                  'fileencoding' seems to really use latin1.

                  @Vlad, yes I noticed this behavior also for em dash and another
                  character (I forget which one). The project I am working on uses
                  latin1 for code and such, and probably needs to continue doing so. The
                  only place I'm using these characters are in personal TODO or notes
                  files. I was surprised when I saw it was working when all indications
                  were that it should not.

                  I have rules in my .vimrc to use utf-8 as my encoding but latin1 for
                  the fenc of most files, unless the file contains any of the bytes
                  which are characters in cp1252 but not in latin1.

                  --
                  You received this message from the "vim_dev" maillist.
                  Do not top-post! Type your reply below the text you are replying to.
                  For more information, visit http://www.vim.org/maillist.php
                • Bram Moolenaar
                  ... I have added a remark in the help file. The main reason to do this is that conversion between cp1252 and latin1 doesn t really work, thus naming them
                  Message 8 of 13 , Feb 1, 2011
                  • 0 Attachment
                    Ben Fritz wrote:

                    > On Jan 29, 6:46 am, Christian Brabandt <cbli...@...> wrote:
                    > > Hi Vlad!
                    > >
                    > > On Fr, 28 Jan 2011, Vlad Irnov wrote:
                    > >
                    > > > Vim apparently uses cp1252 instead of latin-1 for &enc. My
                    > > > understanding is
                    > > > that the only difference between them is that cp1252 has characters
                    > > > for bytes
                    > > > 128-159 while latin-1 uses them as control characters.
                    > >
                    > > Yes, if I read mbyte.c correctly, vim assumes cp1252 and uses this
                    > > encoding for latin1 and iso8859-1
                    > >
                    >
                    > If this is true, I think the :help should mention it. As Vlad
                    > mentions, this practice is fairly common so I don't think it will be
                    > surprising to anyone as long as it is documented. I wonder though, if
                    > pretending to be in latin1 but really being in cp1252 might cause more
                    > headaches than it solves.

                    I have added a remark in the help file. The main reason to do this is
                    that conversion between cp1252 and latin1 doesn't really work, thus
                    naming them differently would cause lots of conversion errors.

                    > Note that this is the case for 'encoding' but not for 'fileencoding'.
                    > 'fileencoding' seems to really use latin1.
                    >
                    > @Vlad, yes I noticed this behavior also for em dash and another
                    > character (I forget which one). The project I am working on uses
                    > latin1 for code and such, and probably needs to continue doing so. The
                    > only place I'm using these characters are in personal TODO or notes
                    > files. I was surprised when I saw it was working when all indications
                    > were that it should not.
                    >
                    > I have rules in my .vimrc to use utf-8 as my encoding but latin1 for
                    > the fenc of most files, unless the file contains any of the bytes
                    > which are characters in cp1252 but not in latin1.

                    --
                    If Microsoft would build a car...
                    ... The airbag system would ask "are you SURE?" before deploying.

                    /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                    /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
                    \\\ an exciting new programming language -- http://www.Zimbu.org ///
                    \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

                    --
                    You received this message from the "vim_dev" maillist.
                    Do not top-post! Type your reply below the text you are replying to.
                    For more information, visit http://www.vim.org/maillist.php
                  • Ben Fritz
                    ... Converting from cp1252 to latin1 should fail depending on the characters in the file, but latin1 to cp1252 should always work, shouldn t it? I understand
                    Message 9 of 13 , Feb 1, 2011
                    • 0 Attachment
                      On Feb 1, 9:50 am, Bram Moolenaar <B...@...> wrote:
                      > Ben Fritz wrote:
                      > > On Jan 29, 6:46 am, Christian Brabandt <cbli...@...> wrote:
                      > > > Hi Vlad!
                      >
                      > > > On Fr, 28 Jan 2011, Vlad Irnov wrote:
                      >
                      > > > > Vim apparently uses cp1252 instead of latin-1 for &enc. My
                      > > > > understanding is
                      > > > > that the only difference between them is that cp1252 has characters
                      > > > > for bytes
                      > > > > 128-159 while latin-1 uses them as control characters.
                      >
                      > > > Yes, if I read mbyte.c correctly, vim assumes cp1252 and uses this
                      > > > encoding for latin1 and iso8859-1
                      >
                      > > If this is true, I think the :help should mention it. As Vlad
                      > > mentions, this practice is fairly common so I don't think it will be
                      > > surprising to anyone as long as it is documented. I wonder though, if
                      > > pretending to be in latin1 but really being in cp1252 might cause more
                      > > headaches than it solves.
                      >
                      > I have added a remark in the help file.  The main reason to do this is
                      > that conversion between cp1252 and latin1 doesn't really work, thus
                      > naming them differently would cause lots of conversion errors.
                      >

                      Converting from cp1252 to latin1 should fail depending on the
                      characters in the file, but latin1 to cp1252 should always work,
                      shouldn't it? I understand cp1252 to be a superset of latin1. Is it
                      because the system mis-represents its encoding to Vim as latin1 when
                      really it is cp1252 or something?

                      --
                      You received this message from the "vim_dev" maillist.
                      Do not top-post! Type your reply below the text you are replying to.
                      For more information, visit http://www.vim.org/maillist.php
                    • Rhialto
                      ... If this means that I get cp1252 characters in my file which I tried to keep pure Latin 1, this is very wrong... my system doesn t display those obnoxious
                      Message 10 of 13 , Feb 1, 2011
                      • 0 Attachment
                        On Tue 01 Feb 2011 at 09:30:48 -0800, Ben Fritz wrote:
                        > Converting from cp1252 to latin1 should fail depending on the
                        > characters in the file, but latin1 to cp1252 should always work,
                        > shouldn't it? I understand cp1252 to be a superset of latin1. Is it
                        > because the system mis-represents its encoding to Vim as latin1 when
                        > really it is cp1252 or something?

                        If this means that I get cp1252 characters in my file which I tried to
                        keep pure Latin 1, this is very wrong... my system doesn't display those
                        obnoxious microsoft "extensions".

                        -Olaf.
                        --
                        ___ Olaf 'Rhialto' Seibert -- There's no point being grown-up if you
                        \X/ rhialto/at/xs4all.nl -- can't be childish sometimes. -The 4th Doctor
                        X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

                        --
                        You received this message from the "vim_dev" maillist.
                        Do not top-post! Type your reply below the text you are replying to.
                        For more information, visit http://www.vim.org/maillist.php
                      • Benjamin Fritz
                        ... For now, if this bothers you, you can set your encoding to something other than latin1 (like utf-8) and do a setglobal fenc=latin1. Also update your
                        Message 11 of 13 , Feb 2, 2011
                        • 0 Attachment
                          On Tue, Feb 1, 2011 at 7:11 PM, Rhialto <rhialto@...> wrote:
                          > On Tue 01 Feb 2011 at 09:30:48 -0800, Ben Fritz wrote:
                          >> Converting from cp1252 to latin1 should fail depending on the
                          >> characters in the file, but latin1 to cp1252 should always work,
                          >> shouldn't it? I understand cp1252 to be a superset of latin1. Is it
                          >> because the system mis-represents its encoding to Vim as latin1 when
                          >> really it is cp1252 or something?
                          >
                          > If this means that I get cp1252 characters in my file which I tried to
                          > keep pure Latin 1, this is very wrong... my system doesn't display those
                          > obnoxious microsoft "extensions".
                          >

                          For now, if this bothers you, you can set your encoding to something
                          other than latin1 (like utf-8) and do a setglobal fenc=latin1. Also
                          update your fileencodings option so that latin1 actually gets
                          detected.

                          Now you will get a warning if you try to save a file and there are
                          non-latin1 characters in it.

                          I think it is a problem that with encoding=latin1, Vim acts
                          differently and you will not get a warning for non-latin1 characters.
                          But apparently a very common (and probably not very serious) problem.

                          cp1252 is basically the same as latin1, with a few extras thrown in
                          where latin1 doesn't have anything useful. So as long as you don't
                          intentionally include any non-latin1 characters, your file will be
                          identical to one saved as a strict latin1 file.

                          --
                          You received this message from the "vim_dev" maillist.
                          Do not top-post! Type your reply below the text you are replying to.
                          For more information, visit http://www.vim.org/maillist.php
                        • Benjamin Fritz
                          ... I see this in :help version7.txt (line 2470): Win32: Set the default for isprint back to the wrong default @,~-255 , because many people use
                          Message 12 of 13 , Feb 3, 2011
                          • 0 Attachment
                            On Wed, Feb 2, 2011 at 9:59 AM, Benjamin Fritz <fritzophrenic@...> wrote:
                            > On Tue, Feb 1, 2011 at 7:11 PM, Rhialto <rhialto@...> wrote:
                            >> On Tue 01 Feb 2011 at 09:30:48 -0800, Ben Fritz wrote:
                            >>> Converting from cp1252 to latin1 should fail depending on the
                            >>> characters in the file, but latin1 to cp1252 should always work,
                            >>> shouldn't it? I understand cp1252 to be a superset of latin1. Is it
                            >>> because the system mis-represents its encoding to Vim as latin1 when
                            >>> really it is cp1252 or something?
                            >>
                            >> If this means that I get cp1252 characters in my file which I tried to
                            >> keep pure Latin 1, this is very wrong... my system doesn't display those
                            >> obnoxious microsoft "extensions".
                            >>
                            >
                            > For now, if this bothers you, you can set your encoding to something
                            > other than latin1 (like utf-8) and do a setglobal fenc=latin1. Also
                            > update your fileencodings option so that latin1 actually gets
                            > detected.
                            >
                            > Now you will get a warning if you try to save a file and there are
                            > non-latin1 characters in it.
                            >

                            I see this in :help version7.txt (line 2470):

                            Win32: Set the default for 'isprint' back to the wrong default "@,~-255",
                            because many people use Windows-1252 while 'encoding' is "latin1".

                            Maybe this is related?

                            --
                            You received this message from the "vim_dev" maillist.
                            Do not top-post! Type your reply below the text you are replying to.
                            For more information, visit http://www.vim.org/maillist.php
                          • Vlad Irnov
                            ... After ... cp1252-specific characters are no longer displayed when encoding is cp1252, so this is not a solution. This is what I think happens: when
                            Message 13 of 13 , Feb 3, 2011
                            • 0 Attachment
                              On Feb 3, 5:03 pm, Benjamin Fritz <fritzophre...@...> wrote:
                              > On Wed, Feb 2, 2011 at 9:59 AM, Benjamin Fritz <fritzophre...@...> wrote:
                              > > On Tue, Feb 1, 2011 at 7:11 PM, Rhialto <rhia...@...> wrote:
                              > >> On Tue 01 Feb 2011 at 09:30:48 -0800, Ben Fritz wrote:
                              > >>> Converting from cp1252 to latin1 should fail depending on the
                              > >>> characters in the file, but latin1 to cp1252 should always work,
                              > >>> shouldn't it? I understand cp1252 to be a superset of latin1. Is it
                              > >>> because the system mis-represents its encoding to Vim as latin1 when
                              > >>> really it is cp1252 or something?

                              > I see this in :help version7.txt (line 2470):
                              >
                              > Win32: Set the default for 'isprint' back to the wrong default "@,~-255",
                              > because many people use Windows-1252 while 'encoding' is "latin1".
                              >
                              > Maybe this is related?

                              After
                              :set isprint=@,161-255
                              cp1252-specific characters are no longer displayed when encoding is
                              cp1252, so this is not a solution.


                              This is what I think happens: when encoding is set to latin1, Vim
                              **displays** characters in the range 128 to 159 (hex 80 to 9F) as if
                              encoding is set to cp1252.

                              How to reproduce: (Windows 2000, gvim 7.3, Normal version)

                              Start GUI Vim with a new empty buffer. Any decent font like DejaVu or
                              Lucida Console should do. Execute the following code (copy into
                              clipboard and execute with :@+ or :@*).

                              :set enc=latin1
                              :set fenc=utf-8
                              :set isprint&
                              :for i in range(128,159)
                              : call setline(".", getline(".").nr2char(i))
                              :endfor

                              You should end up with 5 "no character" blocks plus 27 printable chars
                              (don't know if they survive posting, the first one is Euro sign):

                              € ‚ƒ„…†‡ˆ‰Š‹Œ Ž ‘’“”•–—˜™š›œ žŸ

                              This is wrong. Latin1 character set has no printable chars in this
                              range, so all chars should be displayed as "no character" blocks.
                              From http://en.wikipedia.org/wiki/ISO/IEC_8859-1 :
                              "The Windows-1252 codepage [cp1252 in Vim] coincides with ISO-8859-1
                              [latin1 in Vim] for all codes except the range 128 to 159 (hex 80 to
                              9F), where the little-used C1 controls are replaced with additional
                              characters."

                              This not a standard behavior -- other text editors do not display
                              these chars when encoding is Latin1.

                              When the buffer is saved, Vim converts from latin1 to Unicode. These
                              chars becomes Unicode code points 0x0080 to 0x009F (decimal 128-159,
                              each encoded in 2 bytes in utf-8). They are non-printable characters.
                              This behavior is correct, but probably not what the user expects. To
                              preserve the cp1252-specific characters as they are displayed by Vim,
                              the encoding must be set to cp1252. The bullet character in Unicode is
                              decimal 8226, en dash is 8211, em dash is 8212, each encoded in 3
                              bytes in utf-8.
                              Conversion tables:
                              http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
                              http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

                              --
                              You received this message from the "vim_dev" maillist.
                              Do not top-post! Type your reply below the text you are replying to.
                              For more information, visit http://www.vim.org/maillist.php
                            Your message has been successfully submitted and would be delivered to recipients shortly.