Loading ...
Sorry, an error occurred while loading the content.

57481Re: "flexwiki" ftplugin causing problems ('bomb')

Expand Messages
  • Tony Mechelynck
    Jun 27, 2010
      On 27/06/10 21:21, Benjamin R. Haskell wrote:
      > On Sun, 27 Jun 2010, Tony Mechelynck wrote:
      >
      >> On 03/05/10 23:45, Lech Lorens wrote:
      >> [...]
      >>> I might be totally wrong basing my understanding of BOM and
      >>> character sets mainly on Wikipedia, but I thought that setting
      >>> 'bomb' for utf-8 encoded files (which does not pose a risk of
      >>> misinterpreting the contents due to endianness difference) didn't
      >>> make much sense. For utf-16 that would be another thing.
      >>>
      >>> http://en.wikipedia.org/wiki/Byte-order_mark
      >>>
      >>
      >> Notwithstanding its name, the BOM provides more than just endianness
      >> detection. Actually, it is an "encoding signal" which allows detecting
      >> all five of the following encodings, assuming a UTF-16le file won't
      >> start with a NULL:
      >>
      >> utf-16be FE FF
      >> utf-16le FF FE
      >> utf-8 EF BB BF
      >> utf-32be 00 00 FE FF
      >> utf-32le FF FE 00 00
      >>
      >> For instance, when I was still on XP, I noticed that WordPad could
      >> read UTF-8 files but only if they started with a BOM. When writing
      >> what it called "Unicode", what it produced was UTF-16le with BOM.
      >>
      >> Any file starting 0xEF 0xBB 0xBF can be assumed to be in UTF-8.
      >> Distinguishing UTF-8 from Latin1 or Windows-1252 would otherwise
      >> require scanning the whole file, checking for invalid UTF-8 byte
      >> sequences.
      >
      > Quoting the same Wikipedia article Lech mentioned:
      >
      > "While [the] Unicode standard allows BOM in UTF-8, it does not require
      > or recommend it."
      >
      > and paraphrasing the rest of that paragraph:
      >
      > Using a BOM as the first character of a UTF-8-encoded file can cause
      > problems with the shebang line[1] in Unix-like systems. And
      > UTF-8-capable software is often written to assume UTF-8 unless otherwise
      > directed, so the U+FEFF character at the start of the stream is often
      > interpreted incorrectly.
      >
      > The Unicode UTF-{8,16,32}& BOM FAQ probably worded it better than
      > Wikipedia or I[2].
      >

      Yes, a UTF-8 BOM will interfere with any software that has no knowledge
      of Unicode and expects some particular "magic bytes" at the start, or
      simply won't accept 0xEF 0xBB 0xBF at the start of a document. The #!
      shebang is just one example.

      OTOH, in filetypes where UTF-8 is but one possibility among many, the
      BOM is useful to specify the encoding or to confirm what was set
      otherwise. Examples:

      - HTML charset can be set by the HTTP "Content-Type" header (in an HTTP
      or HTTPS transaction extrernal to the file), in a <meta
      http-equiv="Content-Type" content="text/html; charset=something"> tag
      (replacing "something" by the charset) within the <head> section, or by
      a BOM. There are even official priority rules that tell browsers what to
      do when two or three of the above are present (and they are necessary,
      because -I'm told- some braindead hosts will send "Content-Type:
      text/html; charset=iso-8859-1" for any *.htm or *.html file regardless
      of BOM or <meta> tags).

      - CSS charset can be set by a BOM.

      - XML charset can be set (IIRC) by a <? header line or by a BOM

      - XHTML is both HTML and XML so the methods of both apply to it.

      Personally I use the following rules of thumb:

      - Add a BOM to Unicode files meant for use by a browser.
      - Don't add it to UTF-8 files mostly in US-ASCII (possibly with
      codepoints above 0x7F in literals and comments) if they're meant for use
      by a shell, the 'make' utility, or a compiler.
      - Some Windows programs won't read UTF-8 correctly unless a BOM is present.
      - On Windows, when a system file is said to be in 'Unicode' that usually
      means UTF-16le with BOM.
      - Vim helpfiles in a single directory must either all have a BOM, or
      (recommended) all lack a BOM. If some have one and others not, the
      ":helptags" command will abort with an error.

      This does not explicitly cover all cases; when it doesn't (or in the
      cases where some of the above rules conflict), I proceed by analogy and
      by trial and error.


      Best regards,
      Tony.
      --
      One man's brain plus one other will produce one half as many ideas as
      one man would have produced alone. These two plus two more will
      produce half again as many ideas. These four plus four more begin to
      represent a creative meeting, and the ratio changes to one quarter as
      many ...
      -- Anthony Chevins

      --
      You received this message from the "vim_dev" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php
    • Show all 17 messages in this topic