Loading ...
Sorry, an error occurred while loading the content.

Am I paranoid enough?

Expand Messages
  • David-Sarah Hopwood
    Suppose that S is a Unicode string in which each character matches ValidChar below, not containing the subsequences , and not containing
    Message 1 of 7 , Feb 16 7:16 AM
    • 0 Attachment
      Suppose that S is a Unicode string in which each character matches
      ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
      not containing ("&" followed by a character not matching AmpFollower).
      S encodes a syntactically correct ES3 or ES3.1 source text chosen by
      an attacker.

      ValidChar :: one of
      '\u0009' '\u000A' '\u000D' // TAB, LF, CR
      [\u0020-\u007E]
      [\u00A0-\u00AC]
      [\u00AE-\u05FF]
      [\u0604-\u06DC]
      [\u06DE-\u070E]
      [\u0710-\u17B3]
      [\u17B6-\u200A]
      [\u2010-\u2027]
      [\u202F-\u205F]
      [\u2070-\uD7FF]
      [\uE000-\uFDCF]
      [\uFDF0-\uFEFE]
      [\uFF00-\uFFEF]

      AmpFollower :: one of
      '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
      '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
      // single quote, backslash, space, TAB, LF, CR

      (ValidChar excludes format control characters, and some other
      characters known to be mishandled by browsers. AmpFollower is
      intended to exclude characters that can start an entity reference.)

      S is inserted between "<script>" and "</script>" in a place where a
      <script> tag is allowed in an otherwise valid HTML document, or
      between "<script><![CDATA[" and "]]></script>" in a place where a
      <script> tag is allowed in an otherwise valid XHTML document.
      The HTML or XHTML document starts with a correct <!DOCTYPE or
      <?xml declaration respectively, and is encoded as well-formed
      UTF-8.


      Are these restrictions sufficient to ensure that the embedded
      script is interpreted as it would have been if referenced from
      an external file, foiling any attempts of browsers to collude
      with the attacker in misparsing it?

      Are some of the restrictions unnecessary?

      --
      David-Sarah Hopwood ⚥
    • David-Sarah Hopwood
      No, I m not paranoid enough yet. It s not sufficient only to say that the HTML is encoded as UTF-8 (see below). David-Sarah Hopwood wrote: [...] ... I meant,
      Message 2 of 7 , Feb 16 8:29 AM
      • 0 Attachment
        No, I'm not paranoid enough yet. It's not sufficient only to say
        that the HTML is encoded as UTF-8 (see below).

        David-Sarah Hopwood wrote:
        [...]
        > The HTML or XHTML document starts with a correct <!DOCTYPE or
        > <?xml declaration respectively,

        I meant, the document starts with <!DOCTYPE HTML> in the case
        of HTML, or <?xml version="1.0"?><!DOCTYPE HTML> in the case of
        XHTML.

        (This will also put the parser into sane^H^H^H^Hstandards mode.)

        > and is encoded as well-formed UTF-8.

        The document must also start with a UTF-8 BOM, *and* must not
        contain a META directive that changes the charset, *and* in the
        case of HTML, must either be retrieved from a local file or over
        HTTP with the header "Content-Type: text/html; charset=UTF-8".
        This is because the method of determining the encoding is chosen
        based on the phase of the moon.

        Any other problems?

        --
        David-Sarah Hopwood ⚥
      • Mike Samuel
        2009/2/16 David-Sarah Hopwood ... So no surrogates? ... Why include FFEF? ... You may still be subject to encoding
        Message 3 of 7 , Feb 16 3:38 PM
        • 0 Attachment
          2009/2/16 David-Sarah Hopwood <david.hopwood@...>
          >
          > Suppose that S is a Unicode string in which each character matches
          > ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
          > not containing ("&" followed by a character not matching AmpFollower).
          > S encodes a syntactically correct ES3 or ES3.1 source text chosen by
          > an attacker.
          >
          > ValidChar :: one of
          > '\u0009' '\u000A' '\u000D' // TAB, LF, CR
          > [\u0020-\u007E]
          > [\u00A0-\u00AC]
          > [\u00AE-\u05FF]
          > [\u0604-\u06DC]
          > [\u06DE-\u070E]
          > [\u0710-\u17B3]
          > [\u17B6-\u200A]
          > [\u2010-\u2027]
          > [\u202F-\u205F]
          > [\u2070-\uD7FF]

          So no surrogates?

          > [\uE000-\uFDCF]
          > [\uFDF0-\uFEFE]
          > [\uFF00-\uFFEF]

          Why include FFEF?

          > AmpFollower :: one of
          > '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
          > '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
          > // single quote, backslash, space, TAB, LF, CR
          >
          > (ValidChar excludes format control characters, and some other
          > characters known to be mishandled by browsers. AmpFollower is
          > intended to exclude characters that can start an entity reference.)
          >
          > S is inserted between "<script>" and "</script>" in a place where a
          > <script> tag is allowed in an otherwise valid HTML document, or
          > between "<script><![CDATA[" and "]]></script>" in a place where a
          > <script> tag is allowed in an otherwise valid XHTML document.
          > The HTML or XHTML document starts with a correct <!DOCTYPE or
          > <?xml declaration respectively, and is encoded as well-formed
          > UTF-8.
          >
          > Are these restrictions sufficient to ensure that the embedded
          > script is interpreted as it would have been if referenced from
          > an external file, foiling any attempts of browsers to collude
          > with the attacker in misparsing it?

          You may still be subject to encoding attacks. I'm sure there are
          valid scripts that look like UTF-7, so if the script appears in the
          first 1024B, you might need to make sure it's preceded by a <meta>
          element specifying an encoding, and/or use the XML prologue form that
          specifies an encoding.

          > Are some of the restrictions unnecessary?
          >
          > --
          > David-Sarah Hopwood ⚥
        • David-Sarah Hopwood
          ... Correct. They re not characters (or even noncharacters ). ... It s unassigned, and there s no particular reason to exclude it. ( uFFF0- uFFF8 are also
          Message 4 of 7 , Feb 17 3:13 AM
          • 0 Attachment
            Mike Samuel wrote:
            > 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
            >> Suppose that S is a Unicode string in which each character matches
            >> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
            >> not containing ("&" followed by a character not matching AmpFollower).
            >> S encodes a syntactically correct ES3 or ES3.1 source text chosen by
            >> an attacker.
            >>
            >> ValidChar :: one of
            >> '\u0009' '\u000A' '\u000D' // TAB, LF, CR
            >> [\u0020-\u007E]
            >> [\u00A0-\u00AC]
            >> [\u00AE-\u05FF]
            >> [\u0604-\u06DC]
            >> [\u06DE-\u070E]
            >> [\u0710-\u17B3]
            >> [\u17B6-\u200A]
            >> [\u2010-\u2027]
            >> [\u202F-\u205F]
            >> [\u2070-\uD7FF]
            >
            > So no surrogates?

            Correct. They're not characters (or even "noncharacters").

            >> [\uE000-\uFDCF]
            >> [\uFDF0-\uFEFE]
            >> [\uFF00-\uFFEF]
            >
            > Why include FFEF?

            It's unassigned, and there's no particular reason to exclude it.
            (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
            for "special" characters.)

            >> AmpFollower :: one of
            >> '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
            >> '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
            >> // single quote, backslash, space, TAB, LF, CR
            >>
            >> (ValidChar excludes format control characters, and some other
            >> characters known to be mishandled by browsers. AmpFollower is
            >> intended to exclude characters that can start an entity reference.)
            >>
            >> S is inserted between "<script>" and "</script>" in a place where a
            >> <script> tag is allowed in an otherwise valid HTML document, or
            >> between "<script><![CDATA[" and "]]></script>" in a place where a
            >> <script> tag is allowed in an otherwise valid XHTML document.
            >> The HTML or XHTML document starts with a correct <!DOCTYPE or
            >> <?xml declaration respectively, and is encoded as well-formed
            >> UTF-8.
            >>
            >> Are these restrictions sufficient to ensure that the embedded
            >> script is interpreted as it would have been if referenced from
            >> an external file, foiling any attempts of browsers to collude
            >> with the attacker in misparsing it?
            >
            > You may still be subject to encoding attacks. I'm sure there are
            > valid scripts that look like UTF-7, so if the script appears in the
            > first 1024B, you might need to make sure it's preceded by a <meta>
            > element specifying an encoding, and/or use the XML prologue form that
            > specifies an encoding.

            Right; I covered that in a follow-up. Is including a UTF-8 BOM at the
            start sufficient for all browsers (that is, are there any browsers
            in which a <meta> tag or other content sniffing can override an
            explicit initial UTF-8 BOM, in either HTML or XHTML)?

            HTML5 section 8.2.2.1 seems to indicate that "if the transport layer
            specifies an encoding" (i.e. presumably the charset specified in
            a Content-Type header), then that should override a BOM. That's
            irritating, because it means that you have to assume that the server
            gets the Content-Type right, *as well as* including a BOM for the
            browsers in which Content-Type doesn't override sniffing
            (Internet Explorer, at least), and for the case where the document
            is read from a local file.

            --
            David-Sarah Hopwood ⚥
          • Mike Samuel
            ... Isn t it the reflection of fffe, the byte-order-marker. This is probably a very minor issue, but if one part of a parser naively delegates to another
            Message 5 of 7 , Feb 17 10:50 AM
            • 0 Attachment
              2009/2/17 David-Sarah Hopwood <david.hopwood@...>:
              > Mike Samuel wrote:
              >> 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
              >>> Suppose that S is a Unicode string in which each character matches
              >>> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
              >>> not containing ("&" followed by a character not matching AmpFollower).
              >>> S encodes a syntactically correct ES3 or ES3.1 source text chosen by
              >>> an attacker.
              >>>
              >>> ValidChar :: one of
              >>> '\u0009' '\u000A' '\u000D' // TAB, LF, CR
              >>> [\u0020-\u007E]
              >>> [\u00A0-\u00AC]
              >>> [\u00AE-\u05FF]
              >>> [\u0604-\u06DC]
              >>> [\u06DE-\u070E]
              >>> [\u0710-\u17B3]
              >>> [\u17B6-\u200A]
              >>> [\u2010-\u2027]
              >>> [\u202F-\u205F]
              >>> [\u2070-\uD7FF]
              >>
              >> So no surrogates?
              >
              > Correct. They're not characters (or even "noncharacters").
              >
              >>> [\uE000-\uFDCF]
              >>> [\uFDF0-\uFEFE]
              >>> [\uFF00-\uFFEF]
              >>
              >> Why include FFEF?
              >
              > It's unassigned, and there's no particular reason to exclude it.
              > (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
              > for "special" characters.)

              Isn't it the reflection of fffe, the byte-order-marker.
              This is probably a very minor issue, but if one part of a parser
              naively delegates to another parser that mistakenly treats its content
              as a byte string instead of code units, the presence of a BOM might
              cause the delegatee to misinterpret content when something that looks
              like a BOM appears at the beginning of a chunk of embedded language.


              >>> AmpFollower :: one of
              >>> '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
              >>> '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
              >>> // single quote, backslash, space, TAB, LF, CR
              >>>
              >>> (ValidChar excludes format control characters, and some other
              >>> characters known to be mishandled by browsers. AmpFollower is
              >>> intended to exclude characters that can start an entity reference.)
              >>>
              >>> S is inserted between "<script>" and "</script>" in a place where a
              >>> <script> tag is allowed in an otherwise valid HTML document, or
              >>> between "<script><![CDATA[" and "]]></script>" in a place where a
              >>> <script> tag is allowed in an otherwise valid XHTML document.
              >>> The HTML or XHTML document starts with a correct <!DOCTYPE or
              >>> <?xml declaration respectively, and is encoded as well-formed
              >>> UTF-8.
              >>>
              >>> Are these restrictions sufficient to ensure that the embedded
              >>> script is interpreted as it would have been if referenced from
              >>> an external file, foiling any attempts of browsers to collude
              >>> with the attacker in misparsing it?
              >>
              >> You may still be subject to encoding attacks. I'm sure there are
              >> valid scripts that look like UTF-7, so if the script appears in the
              >> first 1024B, you might need to make sure it's preceded by a <meta>
              >> element specifying an encoding, and/or use the XML prologue form that
              >> specifies an encoding.
              >
              > Right; I covered that in a follow-up. Is including a UTF-8 BOM at the
              > start sufficient for all browsers (that is, are there any browsers
              > in which a <meta> tag or other content sniffing can override an
              > explicit initial UTF-8 BOM, in either HTML or XHTML)?

              Ah cool. I don't know the answer to that question.


              > HTML5 section 8.2.2.1 seems to indicate that "if the transport layer
              > specifies an encoding" (i.e. presumably the charset specified in
              > a Content-Type header), then that should override a BOM. That's
              > irritating, because it means that you have to assume that the server
              > gets the Content-Type right, *as well as* including a BOM for the
              > browsers in which Content-Type doesn't override sniffing
              > (Internet Explorer, at least), and for the case where the document
              > is read from a local file.

              Yeah. I think the best thing to do is to use a fairly standard
              encoding like UTF-8, and make sure the XML prologue, <meta
              http-equiv="content-type">, and headers all agree.

              I don't think that you can do much about file hosting services that go
              out of their way to specify a whacky encoding. Putting a BOM at the
              front will help hosting services that make a genuine effort.


              > --
              > David-Sarah Hopwood ⚥
              >
              >
            • David-Sarah Hopwood
              ... [...] ... No, uFEFF is the BOM, and its byte-reflection uFFFE is a noncharacter, so already excluded from ValidChar. (Thought you d spotted something I d
              Message 6 of 7 , Feb 18 9:26 AM
              • 0 Attachment
                Mike Samuel wrote:
                > 2009/2/17 David-Sarah Hopwood <david.hopwood@...>:
                >> Mike Samuel wrote:
                >>> 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
                >>>> ValidChar :: one of
                [...]
                >>>> [\uFF00-\uFFEF]
                >>> Why include FFEF?
                >> It's unassigned, and there's no particular reason to exclude it.
                >> (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
                >> for "special" characters.)
                >
                > Isn't it the reflection of fffe, the byte-order-marker.

                No, \uFEFF is the BOM, and its byte-reflection \uFFFE is a noncharacter,
                so already excluded from ValidChar.

                (Thought you'd spotted something I'd missed for a second, there.)

                --
                David-Sarah Hopwood ⚥
              • Mike Samuel
                ... Ah, quite right.
                Message 7 of 7 , Feb 18 1:54 PM
                • 0 Attachment
                  2009/2/18 David-Sarah Hopwood <david.hopwood@...>:
                  > Mike Samuel wrote:
                  >> 2009/2/17 David-Sarah Hopwood <david.hopwood@...>:
                  >>> Mike Samuel wrote:
                  >>>> 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
                  >>>>> ValidChar :: one of
                  > [...]
                  >>>>> [\uFF00-\uFFEF]
                  >>>> Why include FFEF?
                  >>> It's unassigned, and there's no particular reason to exclude it.
                  >>> (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
                  >>> for "special" characters.)
                  >>
                  >> Isn't it the reflection of fffe, the byte-order-marker.
                  >
                  > No, \uFEFF is the BOM, and its byte-reflection \uFFFE is a noncharacter,
                  > so already excluded from ValidChar.

                  Ah, quite right.

                  > (Thought you'd spotted something I'd missed for a second, there.)
                  >
                  > --
                  > David-Sarah Hopwood ⚥
                Your message has been successfully submitted and would be delivered to recipients shortly.