Loading ...
Sorry, an error occurred while loading the content.

Re: opening a Unicode file

Expand Messages
  • Tony Mechelynck
    ... This is just one of the possible transfer formats of Unicode. From what you describe, it could be either ucs-2 (which is not a UTF: it represents U+0000
    Message 1 of 6 , Sep 13, 2007
    • 0 Attachment
      msorens wrote:
      > I have been using gvim for several years and vi before that for a
      > couple decades. I thought I understood how to wade through the rather
      > terse documentation, though there are still quite a few features I
      > have not touched.
      >
      > Recently I wanted to try to read a Unicode file in vim. What could be
      > simpler, I thought? Well, I was unsuccessful with vim help. So I
      > searched the web and came across variations of the same settings to
      > add to _vimrc, as shown below. But I still was unsuccessful in just
      > opening an existing Unicode file.
      >
      > I guess I should make sure that my file is Unicode: I opened it in vim
      > in binary mode, then ran xxd, and observed a null byte after very
      > character. (In non-binary mode I see an up-arrow followed by an @-sign
      > after each character.) This *is* Unicode, right?

      This is just one of the possible "transfer formats" of Unicode. From what you
      describe, it could be either ucs-2 (which is not a UTF: it represents U+0000
      to U+FFFF as one 16-bit word each but cannot represent anything above that),
      or UTF-16 (which represents U+0000 to U+FFFF as one 16-bit word each, and
      U+10000 to (IIRC) U+10FFFF by means of pairs of "surrogate" codepoints below
      U+FFFF.

      There are other Unicode Transfer Formats: UTF-32 (which represents each
      Unicode codepoint by one 32-bit doubleword), UTF-8 (which represents UTF
      codepoints by a variable number of 8-bit bytes each) and even GB18030 (which
      can represent all Unicode codepoints, but is optimized in favour of Chinese,
      while UTF-8 is optimized in favour of West-European Latin scripts, especially
      English).

      >
      > The file opens just fine in Notepad; what is the secret about having
      > it "just open right" in vim?

      :e ++enc=utf-16 filename

      This does the equivalent of ":setlocal fileencoding=utf-16" at the same time
      as reading the file (after reading the file would be too late). If you still
      see garbled gobbledygook, it may mean that the file's endianness (i.e., which
      byte comes first in a 16-bit word) is not the same as whatever Vim uses as
      default. In that case, replace "utf-16" above by either "utf-16be" (big
      endian: high byte first) or "utf-16le" (little endian: low byte first).

      See
      :help ++opt
      :help 'fileencoding'
      :help mbyte-encoding


      Best regards,
      Tony.
      --
      It is too bad that the speed of light hasn't kept pace with the
      changes in CPU speed and network bandwidth. -- <wietse@...>

      --~--~---------~--~----~------------~-------~--~----~
      You received this message from the "vim_multibyte" maillist.
      For more information, visit http://www.vim.org/maillist.php
      -~----------~----~----~----~------~----~------~--~---
    • John (Eljay) Love-Jensen
      Hi Tony, ... Thanks Tony! I ve been wondering how to do that! Note: if the utf-16 file contains a BOM (which, often, it should/will), then it should not be
      Message 2 of 6 , Sep 14, 2007
      • 0 Attachment
        Hi Tony,

        > :e ++enc=utf-16 filename

        Thanks Tony! I've been wondering how to do that!

        Note: if the utf-16 file contains a BOM (which, often, it should/will), then it should not be necessary to specify utf-16le or utf-16be explicitly (and, indeed, would be incorrect according to Unicode standards to do so -- Vim probably does the friendly thing anyway).

        I say this not for Tony's edification, because I'm sure that he already knows this, but for everyone else who may be in msorens's situation.

        Also if you need to make sure the file is written with BOM you can use:

        :set bomb

        Or without the BOM:

        :set nobomb

        For some light reading on Unicode 5.0:

        http://www.amazon.com/dp/0321480910/

        HTH,
        --Eljay

        --~--~---------~--~----~------------~-------~--~----~
        You received this message from the "vim_multibyte" maillist.
        For more information, visit http://www.vim.org/maillist.php
        -~----------~----~----~----~------~----~------~--~---
      • Tony Mechelynck
        ... If any Unicode file (here I mean UTF-8, UTF16le, UTF-16be, UTF-32le or UTF-32be -- I ll leave out GB18030 for the moment) starts with a BOM, Vim will
        Message 3 of 6 , Sep 14, 2007
        • 0 Attachment
          John (Eljay) Love-Jensen wrote:
          > Hi Tony,
          >
          >> :e ++enc=utf-16 filename
          >
          > Thanks Tony! I've been wondering how to do that!
          >
          > Note: if the utf-16 file contains a BOM (which, often, it should/will), then it should not be necessary to specify utf-16le or utf-16be explicitly (and, indeed, would be incorrect according to Unicode standards to do so -- Vim probably does the friendly thing anyway).

          If any Unicode file (here I mean UTF-8, UTF16le, UTF-16be, UTF-32le or
          UTF-32be -- I'll leave out GB18030 for the moment) starts with a BOM, Vim will
          recognise it _provided_ that your 'fileencodings' (plural) starts with
          "ucs-bom". In order for it to work properly, though, 'encoding' should already
          be UTF-8 (or UTF-16 or UTF-32, which Vim handles internally as UTF-8 to avoid
          problems with null bytes terminating C strings).

          Specifying explicitly that a file is, for instance, UTF-16le is IMHO not
          "wrong" (unless the file is actually in some other encoding, of course); it is
          just "unnecessary" if the file starts with a BOM.

          >
          > I say this not for Tony's edification, because I'm sure that he already knows this, but for everyone else who may be in msorens's situation.

          :-)

          >
          > Also if you need to make sure the file is written with BOM you can use:
          >
          > :set bomb
          >
          > Or without the BOM:
          >
          > :set nobomb

          ...and if you want to make sure that "newly created" Unicode files will (or
          won't) have a BOM by default you can write

          setglobal bomb
          or
          setglobal nobomb

          in your vimrc. (I use ":setglobal bomb" but YMMV.) This setting has no
          influence on non-Unicode files such as those in Latin1.

          >
          > For some light reading on Unicode 5.0:
          >
          > http://www.amazon.com/dp/0321480910/

          For serious reading, see also http://www.unicode.org/ -- and others.

          >
          > HTH,
          > --Eljay

          Best regards,
          Tony.
          --
          99 blocks of crud on the disk,
          99 blocks of crud!
          You patch a bug, and dump it again:
          100 blocks of crud on the disk!

          100 blocks of crud on the disk,
          100 blocks of crud!
          You patch a bug, and dump it again:
          101 blocks of crud on the disk! ...

          --~--~---------~--~----~------------~-------~--~----~
          You received this message from the "vim_multibyte" maillist.
          For more information, visit http://www.vim.org/maillist.php
          -~----------~----~----~----~------~----~------~--~---
        • mbbill
          àÅ ... --~--~---------~--~----~------------~-------~--~----~ You received this message from the vim_multibyte maillist. For more information, visit
          Message 4 of 6 , Sep 14, 2007
          • 0 Attachment




            >John (Eljay) Love-Jensen wrote:
            >> Hi Tony,
            >>
            >>> :e ++enc=utf-16 filename
            >>
            >> Thanks Tony! I've been wondering how to do that!
            >>
            >> Note: if the utf-16 file contains a BOM (which, often, it should/will), then it should not be necessary to specify utf-16le or utf-16be explicitly (and, indeed, would be incorrect according to Unicode standards to do so -- Vim probably does the friendly thing anyway).
            >
            >If any Unicode file (here I mean UTF-8, UTF16le, UTF-16be, UTF-32le or
            >UTF-32be -- I'll leave out GB18030 for the moment) starts with a BOM, Vim will
            >recognise it _provided_ that your 'fileencodings' (plural) starts with
            >"ucs-bom". In order for it to work properly, though, 'encoding' should already
            >be UTF-8 (or UTF-16 or UTF-32, which Vim handles internally as UTF-8 to avoid
            >problems with null bytes terminating C strings).
            >
            >Specifying explicitly that a file is, for instance, UTF-16le is IMHO not
            >"wrong" (unless the file is actually in some other encoding, of course); it is
            >just "unnecessary" if the file starts with a BOM.
            >
            >>
            >> I say this not for Tony's edification, because I'm sure that he already knows this, but for everyone else who may be in msorens's situation.
            >
            >:-)
            >
            >>
            >> Also if you need to make sure the file is written with BOM you can use:
            >>
            >> :set bomb
            >>
            >> Or without the BOM:
            >>
            >> :set nobomb
            >
            >....and if you want to make sure that "newly created" Unicode files will (or
            >won't) have a BOM by default you can write
            >
            > setglobal bomb
            >or
            > setglobal nobomb
            >
            >in your vimrc. (I use ":setglobal bomb" but YMMV.) This setting has no
            >influence on non-Unicode files such as those in Latin1.
            >
            >>
            >> For some light reading on Unicode 5.0:
            >>
            >> http://www.amazon.com/dp/0321480910/
            >
            >For serious reading, see also http://www.unicode.org/ -- and others.
            >
            >>
            >> HTH,
            >> --Eljay
            >
            >Best regards,
            >Tony.
            >--
            >99 blocks of crud on the disk,
            >99 blocks of crud!
            >You patch a bug, and dump it again:
            >100 blocks of crud on the disk!
            >
            >100 blocks of crud on the disk,
            >100 blocks of crud!
            >You patch a bug, and dump it again:
            >101 blocks of crud on the disk! ...
            >
            >>

            --~--~---------~--~----~------------~-------~--~----~
            You received this message from the "vim_multibyte" maillist.
            For more information, visit http://www.vim.org/maillist.php
            -~----------~----~----~----~------~----~------~--~---
          • Camillo Särs
            ... Beware, though, that if your environment defaults to utf-8 file encoding, then setting bomb will cause the BOM to be written to all new files. This can
            Message 5 of 6 , Sep 15, 2007
            • 0 Attachment
              Tony Mechelynck wrote:
              > ...and if you want to make sure that "newly created" Unicode files will (or
              > won't) have a BOM by default you can write
              >
              > setglobal bomb
              > or
              > setglobal nobomb
              >
              > in your vimrc. (I use ":setglobal bomb" but YMMV.) This setting has no
              > influence on non-Unicode files such as those in Latin1.

              Beware, though, that if your environment defaults to utf-8 file
              encoding, then setting "bomb" will cause the BOM to be written to all
              new files. This can become a problem when dealing with some legacy
              applications that don't expect to see those extra bytes at the
              beginning. Examples range from *nix shells and hashbang (#!) processing
              to Windows .ini file headings [...].

              So this setting may indeed cause some legacy apps to "bomb" on you.
              Pardon the pun, but I thought it was hilarious once I got over the "duh"
              factor after debugging.

              Regards,
              Camillo
              --
              Camillo Särs <ged@...> Aim for the impossible and you
              http://www.ged.fi will achieve the improbable

              --~--~---------~--~----~------------~-------~--~----~
              You received this message from the "vim_multibyte" maillist.
              For more information, visit http://www.vim.org/maillist.php
              -~----------~----~----~----~------~----~------~--~---
            Your message has been successfully submitted and would be delivered to recipients shortly.