Loading ...
Sorry, an error occurred while loading the content.

opening a Unicode file

Expand Messages
  • msorens
    I have been using gvim for several years and vi before that for a couple decades. I thought I understood how to wade through the rather terse documentation,
    Message 1 of 6 , Sep 13, 2007
    • 0 Attachment
      I have been using gvim for several years and vi before that for a
      couple decades. I thought I understood how to wade through the rather
      terse documentation, though there are still quite a few features I
      have not touched.

      Recently I wanted to try to read a Unicode file in vim. What could be
      simpler, I thought? Well, I was unsuccessful with vim help. So I
      searched the web and came across variations of the same settings to
      add to _vimrc, as shown below. But I still was unsuccessful in just
      opening an existing Unicode file.

      I guess I should make sure that my file is Unicode: I opened it in vim
      in binary mode, then ran xxd, and observed a null byte after very
      character. (In non-binary mode I see an up-arrow followed by an @-sign
      after each character.) This *is* Unicode, right?

      The file opens just fine in Notepad; what is the secret about having
      it "just open right" in vim?

      ===============================================
      if has("multi_byte") " if not, we need to recompile
      if &enc !~? '^u' " if the locale 'encoding' starts with u or U
      " then Unicode is already set
      if &tenc == ''
      let &tenc = &enc " save the keyboard charset
      endif
      set enc=utf-8 " to support Unicode fully, we need to be able
      " to represent all Unicode codepoints in
      memory
      endif
      set fencs=ucs-bom,utf-8,latin1
      setg bomb " default for new Unicode files
      setg fenc=latin1 " default for files created from scratch
      else
      echomsg 'Warning: Multibyte support is not compiled-in.'
      endif
      ===============================================
      if has("multi_byte")
      if &termencoding == ""
      let &termencoding = &encoding
      endif
      set encoding=utf-8
      setglobal fileencoding=utf-8 bomb
      set fileencodings=ucs-bom,utf-8,latin1
      endif
      ===============================================


      --~--~---------~--~----~------------~-------~--~----~
      You received this message from the "vim_multibyte" maillist.
      For more information, visit http://www.vim.org/maillist.php
      -~----------~----~----~----~------~----~------~--~---
    • Tony Mechelynck
      ... This is just one of the possible transfer formats of Unicode. From what you describe, it could be either ucs-2 (which is not a UTF: it represents U+0000
      Message 2 of 6 , Sep 13, 2007
      • 0 Attachment
        msorens wrote:
        > I have been using gvim for several years and vi before that for a
        > couple decades. I thought I understood how to wade through the rather
        > terse documentation, though there are still quite a few features I
        > have not touched.
        >
        > Recently I wanted to try to read a Unicode file in vim. What could be
        > simpler, I thought? Well, I was unsuccessful with vim help. So I
        > searched the web and came across variations of the same settings to
        > add to _vimrc, as shown below. But I still was unsuccessful in just
        > opening an existing Unicode file.
        >
        > I guess I should make sure that my file is Unicode: I opened it in vim
        > in binary mode, then ran xxd, and observed a null byte after very
        > character. (In non-binary mode I see an up-arrow followed by an @-sign
        > after each character.) This *is* Unicode, right?

        This is just one of the possible "transfer formats" of Unicode. From what you
        describe, it could be either ucs-2 (which is not a UTF: it represents U+0000
        to U+FFFF as one 16-bit word each but cannot represent anything above that),
        or UTF-16 (which represents U+0000 to U+FFFF as one 16-bit word each, and
        U+10000 to (IIRC) U+10FFFF by means of pairs of "surrogate" codepoints below
        U+FFFF.

        There are other Unicode Transfer Formats: UTF-32 (which represents each
        Unicode codepoint by one 32-bit doubleword), UTF-8 (which represents UTF
        codepoints by a variable number of 8-bit bytes each) and even GB18030 (which
        can represent all Unicode codepoints, but is optimized in favour of Chinese,
        while UTF-8 is optimized in favour of West-European Latin scripts, especially
        English).

        >
        > The file opens just fine in Notepad; what is the secret about having
        > it "just open right" in vim?

        :e ++enc=utf-16 filename

        This does the equivalent of ":setlocal fileencoding=utf-16" at the same time
        as reading the file (after reading the file would be too late). If you still
        see garbled gobbledygook, it may mean that the file's endianness (i.e., which
        byte comes first in a 16-bit word) is not the same as whatever Vim uses as
        default. In that case, replace "utf-16" above by either "utf-16be" (big
        endian: high byte first) or "utf-16le" (little endian: low byte first).

        See
        :help ++opt
        :help 'fileencoding'
        :help mbyte-encoding


        Best regards,
        Tony.
        --
        It is too bad that the speed of light hasn't kept pace with the
        changes in CPU speed and network bandwidth. -- <wietse@...>

        --~--~---------~--~----~------------~-------~--~----~
        You received this message from the "vim_multibyte" maillist.
        For more information, visit http://www.vim.org/maillist.php
        -~----------~----~----~----~------~----~------~--~---
      • John (Eljay) Love-Jensen
        Hi Tony, ... Thanks Tony! I ve been wondering how to do that! Note: if the utf-16 file contains a BOM (which, often, it should/will), then it should not be
        Message 3 of 6 , Sep 14, 2007
        • 0 Attachment
          Hi Tony,

          > :e ++enc=utf-16 filename

          Thanks Tony! I've been wondering how to do that!

          Note: if the utf-16 file contains a BOM (which, often, it should/will), then it should not be necessary to specify utf-16le or utf-16be explicitly (and, indeed, would be incorrect according to Unicode standards to do so -- Vim probably does the friendly thing anyway).

          I say this not for Tony's edification, because I'm sure that he already knows this, but for everyone else who may be in msorens's situation.

          Also if you need to make sure the file is written with BOM you can use:

          :set bomb

          Or without the BOM:

          :set nobomb

          For some light reading on Unicode 5.0:

          http://www.amazon.com/dp/0321480910/

          HTH,
          --Eljay

          --~--~---------~--~----~------------~-------~--~----~
          You received this message from the "vim_multibyte" maillist.
          For more information, visit http://www.vim.org/maillist.php
          -~----------~----~----~----~------~----~------~--~---
        • Tony Mechelynck
          ... If any Unicode file (here I mean UTF-8, UTF16le, UTF-16be, UTF-32le or UTF-32be -- I ll leave out GB18030 for the moment) starts with a BOM, Vim will
          Message 4 of 6 , Sep 14, 2007
          • 0 Attachment
            John (Eljay) Love-Jensen wrote:
            > Hi Tony,
            >
            >> :e ++enc=utf-16 filename
            >
            > Thanks Tony! I've been wondering how to do that!
            >
            > Note: if the utf-16 file contains a BOM (which, often, it should/will), then it should not be necessary to specify utf-16le or utf-16be explicitly (and, indeed, would be incorrect according to Unicode standards to do so -- Vim probably does the friendly thing anyway).

            If any Unicode file (here I mean UTF-8, UTF16le, UTF-16be, UTF-32le or
            UTF-32be -- I'll leave out GB18030 for the moment) starts with a BOM, Vim will
            recognise it _provided_ that your 'fileencodings' (plural) starts with
            "ucs-bom". In order for it to work properly, though, 'encoding' should already
            be UTF-8 (or UTF-16 or UTF-32, which Vim handles internally as UTF-8 to avoid
            problems with null bytes terminating C strings).

            Specifying explicitly that a file is, for instance, UTF-16le is IMHO not
            "wrong" (unless the file is actually in some other encoding, of course); it is
            just "unnecessary" if the file starts with a BOM.

            >
            > I say this not for Tony's edification, because I'm sure that he already knows this, but for everyone else who may be in msorens's situation.

            :-)

            >
            > Also if you need to make sure the file is written with BOM you can use:
            >
            > :set bomb
            >
            > Or without the BOM:
            >
            > :set nobomb

            ...and if you want to make sure that "newly created" Unicode files will (or
            won't) have a BOM by default you can write

            setglobal bomb
            or
            setglobal nobomb

            in your vimrc. (I use ":setglobal bomb" but YMMV.) This setting has no
            influence on non-Unicode files such as those in Latin1.

            >
            > For some light reading on Unicode 5.0:
            >
            > http://www.amazon.com/dp/0321480910/

            For serious reading, see also http://www.unicode.org/ -- and others.

            >
            > HTH,
            > --Eljay

            Best regards,
            Tony.
            --
            99 blocks of crud on the disk,
            99 blocks of crud!
            You patch a bug, and dump it again:
            100 blocks of crud on the disk!

            100 blocks of crud on the disk,
            100 blocks of crud!
            You patch a bug, and dump it again:
            101 blocks of crud on the disk! ...

            --~--~---------~--~----~------------~-------~--~----~
            You received this message from the "vim_multibyte" maillist.
            For more information, visit http://www.vim.org/maillist.php
            -~----------~----~----~----~------~----~------~--~---
          • mbbill
            àÅ ... --~--~---------~--~----~------------~-------~--~----~ You received this message from the vim_multibyte maillist. For more information, visit
            Message 5 of 6 , Sep 14, 2007
            • 0 Attachment




              >John (Eljay) Love-Jensen wrote:
              >> Hi Tony,
              >>
              >>> :e ++enc=utf-16 filename
              >>
              >> Thanks Tony! I've been wondering how to do that!
              >>
              >> Note: if the utf-16 file contains a BOM (which, often, it should/will), then it should not be necessary to specify utf-16le or utf-16be explicitly (and, indeed, would be incorrect according to Unicode standards to do so -- Vim probably does the friendly thing anyway).
              >
              >If any Unicode file (here I mean UTF-8, UTF16le, UTF-16be, UTF-32le or
              >UTF-32be -- I'll leave out GB18030 for the moment) starts with a BOM, Vim will
              >recognise it _provided_ that your 'fileencodings' (plural) starts with
              >"ucs-bom". In order for it to work properly, though, 'encoding' should already
              >be UTF-8 (or UTF-16 or UTF-32, which Vim handles internally as UTF-8 to avoid
              >problems with null bytes terminating C strings).
              >
              >Specifying explicitly that a file is, for instance, UTF-16le is IMHO not
              >"wrong" (unless the file is actually in some other encoding, of course); it is
              >just "unnecessary" if the file starts with a BOM.
              >
              >>
              >> I say this not for Tony's edification, because I'm sure that he already knows this, but for everyone else who may be in msorens's situation.
              >
              >:-)
              >
              >>
              >> Also if you need to make sure the file is written with BOM you can use:
              >>
              >> :set bomb
              >>
              >> Or without the BOM:
              >>
              >> :set nobomb
              >
              >....and if you want to make sure that "newly created" Unicode files will (or
              >won't) have a BOM by default you can write
              >
              > setglobal bomb
              >or
              > setglobal nobomb
              >
              >in your vimrc. (I use ":setglobal bomb" but YMMV.) This setting has no
              >influence on non-Unicode files such as those in Latin1.
              >
              >>
              >> For some light reading on Unicode 5.0:
              >>
              >> http://www.amazon.com/dp/0321480910/
              >
              >For serious reading, see also http://www.unicode.org/ -- and others.
              >
              >>
              >> HTH,
              >> --Eljay
              >
              >Best regards,
              >Tony.
              >--
              >99 blocks of crud on the disk,
              >99 blocks of crud!
              >You patch a bug, and dump it again:
              >100 blocks of crud on the disk!
              >
              >100 blocks of crud on the disk,
              >100 blocks of crud!
              >You patch a bug, and dump it again:
              >101 blocks of crud on the disk! ...
              >
              >>

              --~--~---------~--~----~------------~-------~--~----~
              You received this message from the "vim_multibyte" maillist.
              For more information, visit http://www.vim.org/maillist.php
              -~----------~----~----~----~------~----~------~--~---
            • Camillo Särs
              ... Beware, though, that if your environment defaults to utf-8 file encoding, then setting bomb will cause the BOM to be written to all new files. This can
              Message 6 of 6 , Sep 15, 2007
              • 0 Attachment
                Tony Mechelynck wrote:
                > ...and if you want to make sure that "newly created" Unicode files will (or
                > won't) have a BOM by default you can write
                >
                > setglobal bomb
                > or
                > setglobal nobomb
                >
                > in your vimrc. (I use ":setglobal bomb" but YMMV.) This setting has no
                > influence on non-Unicode files such as those in Latin1.

                Beware, though, that if your environment defaults to utf-8 file
                encoding, then setting "bomb" will cause the BOM to be written to all
                new files. This can become a problem when dealing with some legacy
                applications that don't expect to see those extra bytes at the
                beginning. Examples range from *nix shells and hashbang (#!) processing
                to Windows .ini file headings [...].

                So this setting may indeed cause some legacy apps to "bomb" on you.
                Pardon the pun, but I thought it was hilarious once I got over the "duh"
                factor after debugging.

                Regards,
                Camillo
                --
                Camillo Särs <ged@...> Aim for the impossible and you
                http://www.ged.fi will achieve the improbable

                --~--~---------~--~----~------------~-------~--~----~
                You received this message from the "vim_multibyte" maillist.
                For more information, visit http://www.vim.org/maillist.php
                -~----------~----~----~----~------~----~------~--~---
              Your message has been successfully submitted and would be delivered to recipients shortly.