Loading ...
Sorry, an error occurred while loading the content.

Re: [caplet] Am I paranoid enough?

Expand Messages
  • Mike Samuel
    2009/2/16 David-Sarah Hopwood ... So no surrogates? ... Why include FFEF? ... You may still be subject to encoding
    Message 1 of 7 , Feb 16, 2009
    • 0 Attachment
      2009/2/16 David-Sarah Hopwood <david.hopwood@...>
      >
      > Suppose that S is a Unicode string in which each character matches
      > ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
      > not containing ("&" followed by a character not matching AmpFollower).
      > S encodes a syntactically correct ES3 or ES3.1 source text chosen by
      > an attacker.
      >
      > ValidChar :: one of
      > '\u0009' '\u000A' '\u000D' // TAB, LF, CR
      > [\u0020-\u007E]
      > [\u00A0-\u00AC]
      > [\u00AE-\u05FF]
      > [\u0604-\u06DC]
      > [\u06DE-\u070E]
      > [\u0710-\u17B3]
      > [\u17B6-\u200A]
      > [\u2010-\u2027]
      > [\u202F-\u205F]
      > [\u2070-\uD7FF]

      So no surrogates?

      > [\uE000-\uFDCF]
      > [\uFDF0-\uFEFE]
      > [\uFF00-\uFFEF]

      Why include FFEF?

      > AmpFollower :: one of
      > '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
      > '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
      > // single quote, backslash, space, TAB, LF, CR
      >
      > (ValidChar excludes format control characters, and some other
      > characters known to be mishandled by browsers. AmpFollower is
      > intended to exclude characters that can start an entity reference.)
      >
      > S is inserted between "<script>" and "</script>" in a place where a
      > <script> tag is allowed in an otherwise valid HTML document, or
      > between "<script><![CDATA[" and "]]></script>" in a place where a
      > <script> tag is allowed in an otherwise valid XHTML document.
      > The HTML or XHTML document starts with a correct <!DOCTYPE or
      > <?xml declaration respectively, and is encoded as well-formed
      > UTF-8.
      >
      > Are these restrictions sufficient to ensure that the embedded
      > script is interpreted as it would have been if referenced from
      > an external file, foiling any attempts of browsers to collude
      > with the attacker in misparsing it?

      You may still be subject to encoding attacks. I'm sure there are
      valid scripts that look like UTF-7, so if the script appears in the
      first 1024B, you might need to make sure it's preceded by a <meta>
      element specifying an encoding, and/or use the XML prologue form that
      specifies an encoding.

      > Are some of the restrictions unnecessary?
      >
      > --
      > David-Sarah Hopwood ⚥
    • David-Sarah Hopwood
      ... Correct. They re not characters (or even noncharacters ). ... It s unassigned, and there s no particular reason to exclude it. ( uFFF0- uFFF8 are also
      Message 2 of 7 , Feb 17, 2009
      • 0 Attachment
        Mike Samuel wrote:
        > 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
        >> Suppose that S is a Unicode string in which each character matches
        >> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
        >> not containing ("&" followed by a character not matching AmpFollower).
        >> S encodes a syntactically correct ES3 or ES3.1 source text chosen by
        >> an attacker.
        >>
        >> ValidChar :: one of
        >> '\u0009' '\u000A' '\u000D' // TAB, LF, CR
        >> [\u0020-\u007E]
        >> [\u00A0-\u00AC]
        >> [\u00AE-\u05FF]
        >> [\u0604-\u06DC]
        >> [\u06DE-\u070E]
        >> [\u0710-\u17B3]
        >> [\u17B6-\u200A]
        >> [\u2010-\u2027]
        >> [\u202F-\u205F]
        >> [\u2070-\uD7FF]
        >
        > So no surrogates?

        Correct. They're not characters (or even "noncharacters").

        >> [\uE000-\uFDCF]
        >> [\uFDF0-\uFEFE]
        >> [\uFF00-\uFFEF]
        >
        > Why include FFEF?

        It's unassigned, and there's no particular reason to exclude it.
        (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
        for "special" characters.)

        >> AmpFollower :: one of
        >> '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
        >> '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
        >> // single quote, backslash, space, TAB, LF, CR
        >>
        >> (ValidChar excludes format control characters, and some other
        >> characters known to be mishandled by browsers. AmpFollower is
        >> intended to exclude characters that can start an entity reference.)
        >>
        >> S is inserted between "<script>" and "</script>" in a place where a
        >> <script> tag is allowed in an otherwise valid HTML document, or
        >> between "<script><![CDATA[" and "]]></script>" in a place where a
        >> <script> tag is allowed in an otherwise valid XHTML document.
        >> The HTML or XHTML document starts with a correct <!DOCTYPE or
        >> <?xml declaration respectively, and is encoded as well-formed
        >> UTF-8.
        >>
        >> Are these restrictions sufficient to ensure that the embedded
        >> script is interpreted as it would have been if referenced from
        >> an external file, foiling any attempts of browsers to collude
        >> with the attacker in misparsing it?
        >
        > You may still be subject to encoding attacks. I'm sure there are
        > valid scripts that look like UTF-7, so if the script appears in the
        > first 1024B, you might need to make sure it's preceded by a <meta>
        > element specifying an encoding, and/or use the XML prologue form that
        > specifies an encoding.

        Right; I covered that in a follow-up. Is including a UTF-8 BOM at the
        start sufficient for all browsers (that is, are there any browsers
        in which a <meta> tag or other content sniffing can override an
        explicit initial UTF-8 BOM, in either HTML or XHTML)?

        HTML5 section 8.2.2.1 seems to indicate that "if the transport layer
        specifies an encoding" (i.e. presumably the charset specified in
        a Content-Type header), then that should override a BOM. That's
        irritating, because it means that you have to assume that the server
        gets the Content-Type right, *as well as* including a BOM for the
        browsers in which Content-Type doesn't override sniffing
        (Internet Explorer, at least), and for the case where the document
        is read from a local file.

        --
        David-Sarah Hopwood ⚥
      • Mike Samuel
        ... Isn t it the reflection of fffe, the byte-order-marker. This is probably a very minor issue, but if one part of a parser naively delegates to another
        Message 3 of 7 , Feb 17, 2009
        • 0 Attachment
          2009/2/17 David-Sarah Hopwood <david.hopwood@...>:
          > Mike Samuel wrote:
          >> 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
          >>> Suppose that S is a Unicode string in which each character matches
          >>> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
          >>> not containing ("&" followed by a character not matching AmpFollower).
          >>> S encodes a syntactically correct ES3 or ES3.1 source text chosen by
          >>> an attacker.
          >>>
          >>> ValidChar :: one of
          >>> '\u0009' '\u000A' '\u000D' // TAB, LF, CR
          >>> [\u0020-\u007E]
          >>> [\u00A0-\u00AC]
          >>> [\u00AE-\u05FF]
          >>> [\u0604-\u06DC]
          >>> [\u06DE-\u070E]
          >>> [\u0710-\u17B3]
          >>> [\u17B6-\u200A]
          >>> [\u2010-\u2027]
          >>> [\u202F-\u205F]
          >>> [\u2070-\uD7FF]
          >>
          >> So no surrogates?
          >
          > Correct. They're not characters (or even "noncharacters").
          >
          >>> [\uE000-\uFDCF]
          >>> [\uFDF0-\uFEFE]
          >>> [\uFF00-\uFFEF]
          >>
          >> Why include FFEF?
          >
          > It's unassigned, and there's no particular reason to exclude it.
          > (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
          > for "special" characters.)

          Isn't it the reflection of fffe, the byte-order-marker.
          This is probably a very minor issue, but if one part of a parser
          naively delegates to another parser that mistakenly treats its content
          as a byte string instead of code units, the presence of a BOM might
          cause the delegatee to misinterpret content when something that looks
          like a BOM appears at the beginning of a chunk of embedded language.


          >>> AmpFollower :: one of
          >>> '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
          >>> '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
          >>> // single quote, backslash, space, TAB, LF, CR
          >>>
          >>> (ValidChar excludes format control characters, and some other
          >>> characters known to be mishandled by browsers. AmpFollower is
          >>> intended to exclude characters that can start an entity reference.)
          >>>
          >>> S is inserted between "<script>" and "</script>" in a place where a
          >>> <script> tag is allowed in an otherwise valid HTML document, or
          >>> between "<script><![CDATA[" and "]]></script>" in a place where a
          >>> <script> tag is allowed in an otherwise valid XHTML document.
          >>> The HTML or XHTML document starts with a correct <!DOCTYPE or
          >>> <?xml declaration respectively, and is encoded as well-formed
          >>> UTF-8.
          >>>
          >>> Are these restrictions sufficient to ensure that the embedded
          >>> script is interpreted as it would have been if referenced from
          >>> an external file, foiling any attempts of browsers to collude
          >>> with the attacker in misparsing it?
          >>
          >> You may still be subject to encoding attacks. I'm sure there are
          >> valid scripts that look like UTF-7, so if the script appears in the
          >> first 1024B, you might need to make sure it's preceded by a <meta>
          >> element specifying an encoding, and/or use the XML prologue form that
          >> specifies an encoding.
          >
          > Right; I covered that in a follow-up. Is including a UTF-8 BOM at the
          > start sufficient for all browsers (that is, are there any browsers
          > in which a <meta> tag or other content sniffing can override an
          > explicit initial UTF-8 BOM, in either HTML or XHTML)?

          Ah cool. I don't know the answer to that question.


          > HTML5 section 8.2.2.1 seems to indicate that "if the transport layer
          > specifies an encoding" (i.e. presumably the charset specified in
          > a Content-Type header), then that should override a BOM. That's
          > irritating, because it means that you have to assume that the server
          > gets the Content-Type right, *as well as* including a BOM for the
          > browsers in which Content-Type doesn't override sniffing
          > (Internet Explorer, at least), and for the case where the document
          > is read from a local file.

          Yeah. I think the best thing to do is to use a fairly standard
          encoding like UTF-8, and make sure the XML prologue, <meta
          http-equiv="content-type">, and headers all agree.

          I don't think that you can do much about file hosting services that go
          out of their way to specify a whacky encoding. Putting a BOM at the
          front will help hosting services that make a genuine effort.


          > --
          > David-Sarah Hopwood ⚥
          >
          >
        • David-Sarah Hopwood
          ... [...] ... No, uFEFF is the BOM, and its byte-reflection uFFFE is a noncharacter, so already excluded from ValidChar. (Thought you d spotted something I d
          Message 4 of 7 , Feb 18, 2009
          • 0 Attachment
            Mike Samuel wrote:
            > 2009/2/17 David-Sarah Hopwood <david.hopwood@...>:
            >> Mike Samuel wrote:
            >>> 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
            >>>> ValidChar :: one of
            [...]
            >>>> [\uFF00-\uFFEF]
            >>> Why include FFEF?
            >> It's unassigned, and there's no particular reason to exclude it.
            >> (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
            >> for "special" characters.)
            >
            > Isn't it the reflection of fffe, the byte-order-marker.

            No, \uFEFF is the BOM, and its byte-reflection \uFFFE is a noncharacter,
            so already excluded from ValidChar.

            (Thought you'd spotted something I'd missed for a second, there.)

            --
            David-Sarah Hopwood ⚥
          • Mike Samuel
            ... Ah, quite right.
            Message 5 of 7 , Feb 18, 2009
            • 0 Attachment
              2009/2/18 David-Sarah Hopwood <david.hopwood@...>:
              > Mike Samuel wrote:
              >> 2009/2/17 David-Sarah Hopwood <david.hopwood@...>:
              >>> Mike Samuel wrote:
              >>>> 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
              >>>>> ValidChar :: one of
              > [...]
              >>>>> [\uFF00-\uFFEF]
              >>>> Why include FFEF?
              >>> It's unassigned, and there's no particular reason to exclude it.
              >>> (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
              >>> for "special" characters.)
              >>
              >> Isn't it the reflection of fffe, the byte-order-marker.
              >
              > No, \uFEFF is the BOM, and its byte-reflection \uFFFE is a noncharacter,
              > so already excluded from ValidChar.

              Ah, quite right.

              > (Thought you'd spotted something I'd missed for a second, there.)
              >
              > --
              > David-Sarah Hopwood ⚥
            Your message has been successfully submitted and would be delivered to recipients shortly.