Loading ...
Sorry, an error occurred while loading the content.

Re: [caplet] Am I paranoid enough?

Expand Messages
  • David-Sarah Hopwood
    ... Correct. They re not characters (or even noncharacters ). ... It s unassigned, and there s no particular reason to exclude it. ( uFFF0- uFFF8 are also
    Message 1 of 7 , Feb 17, 2009
    • 0 Attachment
      Mike Samuel wrote:
      > 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
      >> Suppose that S is a Unicode string in which each character matches
      >> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
      >> not containing ("&" followed by a character not matching AmpFollower).
      >> S encodes a syntactically correct ES3 or ES3.1 source text chosen by
      >> an attacker.
      >>
      >> ValidChar :: one of
      >> '\u0009' '\u000A' '\u000D' // TAB, LF, CR
      >> [\u0020-\u007E]
      >> [\u00A0-\u00AC]
      >> [\u00AE-\u05FF]
      >> [\u0604-\u06DC]
      >> [\u06DE-\u070E]
      >> [\u0710-\u17B3]
      >> [\u17B6-\u200A]
      >> [\u2010-\u2027]
      >> [\u202F-\u205F]
      >> [\u2070-\uD7FF]
      >
      > So no surrogates?

      Correct. They're not characters (or even "noncharacters").

      >> [\uE000-\uFDCF]
      >> [\uFDF0-\uFEFE]
      >> [\uFF00-\uFFEF]
      >
      > Why include FFEF?

      It's unassigned, and there's no particular reason to exclude it.
      (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
      for "special" characters.)

      >> AmpFollower :: one of
      >> '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
      >> '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
      >> // single quote, backslash, space, TAB, LF, CR
      >>
      >> (ValidChar excludes format control characters, and some other
      >> characters known to be mishandled by browsers. AmpFollower is
      >> intended to exclude characters that can start an entity reference.)
      >>
      >> S is inserted between "<script>" and "</script>" in a place where a
      >> <script> tag is allowed in an otherwise valid HTML document, or
      >> between "<script><![CDATA[" and "]]></script>" in a place where a
      >> <script> tag is allowed in an otherwise valid XHTML document.
      >> The HTML or XHTML document starts with a correct <!DOCTYPE or
      >> <?xml declaration respectively, and is encoded as well-formed
      >> UTF-8.
      >>
      >> Are these restrictions sufficient to ensure that the embedded
      >> script is interpreted as it would have been if referenced from
      >> an external file, foiling any attempts of browsers to collude
      >> with the attacker in misparsing it?
      >
      > You may still be subject to encoding attacks. I'm sure there are
      > valid scripts that look like UTF-7, so if the script appears in the
      > first 1024B, you might need to make sure it's preceded by a <meta>
      > element specifying an encoding, and/or use the XML prologue form that
      > specifies an encoding.

      Right; I covered that in a follow-up. Is including a UTF-8 BOM at the
      start sufficient for all browsers (that is, are there any browsers
      in which a <meta> tag or other content sniffing can override an
      explicit initial UTF-8 BOM, in either HTML or XHTML)?

      HTML5 section 8.2.2.1 seems to indicate that "if the transport layer
      specifies an encoding" (i.e. presumably the charset specified in
      a Content-Type header), then that should override a BOM. That's
      irritating, because it means that you have to assume that the server
      gets the Content-Type right, *as well as* including a BOM for the
      browsers in which Content-Type doesn't override sniffing
      (Internet Explorer, at least), and for the case where the document
      is read from a local file.

      --
      David-Sarah Hopwood ⚥
    • Mike Samuel
      ... Isn t it the reflection of fffe, the byte-order-marker. This is probably a very minor issue, but if one part of a parser naively delegates to another
      Message 2 of 7 , Feb 17, 2009
      • 0 Attachment
        2009/2/17 David-Sarah Hopwood <david.hopwood@...>:
        > Mike Samuel wrote:
        >> 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
        >>> Suppose that S is a Unicode string in which each character matches
        >>> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
        >>> not containing ("&" followed by a character not matching AmpFollower).
        >>> S encodes a syntactically correct ES3 or ES3.1 source text chosen by
        >>> an attacker.
        >>>
        >>> ValidChar :: one of
        >>> '\u0009' '\u000A' '\u000D' // TAB, LF, CR
        >>> [\u0020-\u007E]
        >>> [\u00A0-\u00AC]
        >>> [\u00AE-\u05FF]
        >>> [\u0604-\u06DC]
        >>> [\u06DE-\u070E]
        >>> [\u0710-\u17B3]
        >>> [\u17B6-\u200A]
        >>> [\u2010-\u2027]
        >>> [\u202F-\u205F]
        >>> [\u2070-\uD7FF]
        >>
        >> So no surrogates?
        >
        > Correct. They're not characters (or even "noncharacters").
        >
        >>> [\uE000-\uFDCF]
        >>> [\uFDF0-\uFEFE]
        >>> [\uFF00-\uFFEF]
        >>
        >> Why include FFEF?
        >
        > It's unassigned, and there's no particular reason to exclude it.
        > (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
        > for "special" characters.)

        Isn't it the reflection of fffe, the byte-order-marker.
        This is probably a very minor issue, but if one part of a parser
        naively delegates to another parser that mistakenly treats its content
        as a byte string instead of code units, the presence of a BOM might
        cause the delegatee to misinterpret content when something that looks
        like a BOM appears at the beginning of a chunk of embedded language.


        >>> AmpFollower :: one of
        >>> '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
        >>> '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
        >>> // single quote, backslash, space, TAB, LF, CR
        >>>
        >>> (ValidChar excludes format control characters, and some other
        >>> characters known to be mishandled by browsers. AmpFollower is
        >>> intended to exclude characters that can start an entity reference.)
        >>>
        >>> S is inserted between "<script>" and "</script>" in a place where a
        >>> <script> tag is allowed in an otherwise valid HTML document, or
        >>> between "<script><![CDATA[" and "]]></script>" in a place where a
        >>> <script> tag is allowed in an otherwise valid XHTML document.
        >>> The HTML or XHTML document starts with a correct <!DOCTYPE or
        >>> <?xml declaration respectively, and is encoded as well-formed
        >>> UTF-8.
        >>>
        >>> Are these restrictions sufficient to ensure that the embedded
        >>> script is interpreted as it would have been if referenced from
        >>> an external file, foiling any attempts of browsers to collude
        >>> with the attacker in misparsing it?
        >>
        >> You may still be subject to encoding attacks. I'm sure there are
        >> valid scripts that look like UTF-7, so if the script appears in the
        >> first 1024B, you might need to make sure it's preceded by a <meta>
        >> element specifying an encoding, and/or use the XML prologue form that
        >> specifies an encoding.
        >
        > Right; I covered that in a follow-up. Is including a UTF-8 BOM at the
        > start sufficient for all browsers (that is, are there any browsers
        > in which a <meta> tag or other content sniffing can override an
        > explicit initial UTF-8 BOM, in either HTML or XHTML)?

        Ah cool. I don't know the answer to that question.


        > HTML5 section 8.2.2.1 seems to indicate that "if the transport layer
        > specifies an encoding" (i.e. presumably the charset specified in
        > a Content-Type header), then that should override a BOM. That's
        > irritating, because it means that you have to assume that the server
        > gets the Content-Type right, *as well as* including a BOM for the
        > browsers in which Content-Type doesn't override sniffing
        > (Internet Explorer, at least), and for the case where the document
        > is read from a local file.

        Yeah. I think the best thing to do is to use a fairly standard
        encoding like UTF-8, and make sure the XML prologue, <meta
        http-equiv="content-type">, and headers all agree.

        I don't think that you can do much about file hosting services that go
        out of their way to specify a whacky encoding. Putting a BOM at the
        front will help hosting services that make a genuine effort.


        > --
        > David-Sarah Hopwood ⚥
        >
        >
      • David-Sarah Hopwood
        ... [...] ... No, uFEFF is the BOM, and its byte-reflection uFFFE is a noncharacter, so already excluded from ValidChar. (Thought you d spotted something I d
        Message 3 of 7 , Feb 18, 2009
        • 0 Attachment
          Mike Samuel wrote:
          > 2009/2/17 David-Sarah Hopwood <david.hopwood@...>:
          >> Mike Samuel wrote:
          >>> 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
          >>>> ValidChar :: one of
          [...]
          >>>> [\uFF00-\uFFEF]
          >>> Why include FFEF?
          >> It's unassigned, and there's no particular reason to exclude it.
          >> (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
          >> for "special" characters.)
          >
          > Isn't it the reflection of fffe, the byte-order-marker.

          No, \uFEFF is the BOM, and its byte-reflection \uFFFE is a noncharacter,
          so already excluded from ValidChar.

          (Thought you'd spotted something I'd missed for a second, there.)

          --
          David-Sarah Hopwood ⚥
        • Mike Samuel
          ... Ah, quite right.
          Message 4 of 7 , Feb 18, 2009
          • 0 Attachment
            2009/2/18 David-Sarah Hopwood <david.hopwood@...>:
            > Mike Samuel wrote:
            >> 2009/2/17 David-Sarah Hopwood <david.hopwood@...>:
            >>> Mike Samuel wrote:
            >>>> 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
            >>>>> ValidChar :: one of
            > [...]
            >>>>> [\uFF00-\uFFEF]
            >>>> Why include FFEF?
            >>> It's unassigned, and there's no particular reason to exclude it.
            >>> (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
            >>> for "special" characters.)
            >>
            >> Isn't it the reflection of fffe, the byte-order-marker.
            >
            > No, \uFEFF is the BOM, and its byte-reflection \uFFFE is a noncharacter,
            > so already excluded from ValidChar.

            Ah, quite right.

            > (Thought you'd spotted something I'd missed for a second, there.)
            >
            > --
            > David-Sarah Hopwood ⚥
          Your message has been successfully submitted and would be delivered to recipients shortly.