Loading ...
Sorry, an error occurred while loading the content.

ADsafe, Take 6

Expand Messages
  • Douglas Crockford
    The next step is to secure HTML fragments. JSLint has an HTML fragment option. When used with ADsafe, it will accept a or and its contents. It
    Message 1 of 30 , Oct 16, 2007
    View Source
    • 0 Attachment
      The next step is to secure HTML fragments. JSLint has an HTML fragment
      option. When used with ADsafe, it will accept a <div> or <iframe> and
      its contents. It will be inspected for XSS attacks and other worries.

      The <div> may contain a <script> that will also be vetted and vatted.

      The biggest open issue is policy on id's of HTML elements. I'll be
      working with our ad system people to resolve that.

      Safe HTML makes safe JS look easy. Really easy. Please let me know
      what XSS attacks get passed.

      http://www.JSLint.com/
    • collin_jackson
      ... fragment ... and ... worries. ... vatted.
      Message 2 of 30 , Oct 16, 2007
      View Source
      • 0 Attachment
        <div x="\"><img onload=alert(42)
        src=http://json.org/img/json160.gif>"></div>

        --- In caplet@yahoogroups.com, "Douglas Crockford" <douglas@...>
        wrote:
        >
        > The next step is to secure HTML fragments. JSLint has an HTML
        fragment
        > option. When used with ADsafe, it will accept a <div> or <iframe>
        and
        > its contents. It will be inspected for XSS attacks and other
        worries.
        >
        > The <div> may contain a <script> that will also be vetted and
        vatted.
        >
        > The biggest open issue is policy on id's of HTML elements. I'll be
        > working with our ad system people to resolve that.
        >
        > Safe HTML makes safe JS look easy. Really easy. Please let me know
        > what XSS attacks get passed.
        >
        > http://www.JSLint.com/
        >
      • Douglas Crockford
        ... Excellent. Keep them coming.
        Message 3 of 30 , Oct 16, 2007
        View Source
        • 0 Attachment
          --- In caplet@yahoogroups.com, "collin_jackson" <collinj@...> wrote:
          >
          > <div x="\"><img onload=alert(42)
          > src=http://json.org/img/json160.gif>"></div>

          Excellent. Keep them coming.
        • collin_jackson
          Null byte between java and script passes JSLint on Firefox despite being an attack on IE: Also:
          Message 4 of 30 , Oct 16, 2007
          View Source
          • 0 Attachment
            Null byte between "java" and "script" passes JSLint on Firefox despite
            being an attack on IE: <iframe src="java�script:alert(42)"></iframe>

            Also:

            <iframe src="data:text/html,<body onload=alert(42) />"></iframe>

            --- In caplet@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
            >
            > --- In caplet@yahoogroups.com, "collin_jackson" <collinj@> wrote:
            > >
            > > <div x="\"><img onload=alert(42)
            > > src=http://json.org/img/json160.gif>"></div>
            >
            > Excellent. Keep them coming.
            >
          • collin_jackson
            Also: ... despite ... (42)
            Message 5 of 30 , Oct 16, 2007
            View Source
            • 0 Attachment
              Also: <div style="width: expres/**/sion
              (document.body.innerHTML='gotcha')"></div>

              --- In caplet@yahoogroups.com, "collin_jackson" <collinj@...> wrote:
              >
              > Null byte between "java" and "script" passes JSLint on Firefox
              despite
              > being an attack on IE: <iframe src="java�script:alert
              (42)"></iframe>
              >
              > Also:
              >
              > <iframe src="data:text/html,<body onload=alert(42) />"></iframe>
              >
              > --- In caplet@yahoogroups.com, "Douglas Crockford" <douglas@>
              wrote:
              > >
              > > --- In caplet@yahoogroups.com, "collin_jackson" <collinj@> wrote:
              > > >
              > > > <div x="\"><img onload=alert(42)
              > > > src=http://json.org/img/json160.gif>"></div>
              > >
              > > Excellent. Keep them coming.
              > >
              >
            • Douglas Crockford
              ... I scan every line for null and other characters. I am guessing that the null is lost in the browser s paste process. In production, inspection will be done
              Message 6 of 30 , Oct 17, 2007
              View Source
              • 0 Attachment
                --- In caplet@yahoogroups.com, "collin_jackson" <collinj@...> wrote:
                >
                > Null byte between "java" and "script" passes JSLint on Firefox despite
                > being an attack on IE

                I scan every line for null and other characters. I am guessing that
                the null is lost in the browser's paste process. In production,
                inspection will be done on files, so I don't think that will be a problem.
              • collin_jackson
                I m not pasting. I m reading the value of a textarea into JSLint directly using JavaScript. See http://crypto.stanford.edu/jsonrequest/nullbyte2.html It looks
                Message 7 of 30 , Oct 17, 2007
                View Source
                • 0 Attachment
                  I'm not pasting. I'm reading the value of a textarea into JSLint
                  directly using JavaScript.

                  See http://crypto.stanford.edu/jsonrequest/nullbyte2.html

                  It looks like Firefox is converting null bytes to Unicode character
                  65533, which isn't rejected by JSLint. So all you need to do is reject
                  Unicode character 65533 to defeat this attack.

                  (Note that null bytes vanish in IE, which is fine as long as Firefox
                  rejects them.)

                  --- In caplet@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
                  >
                  > --- In caplet@yahoogroups.com, "collin_jackson" <collinj@> wrote:
                  > >
                  > > Null byte between "java" and "script" passes JSLint on Firefox
                  despite
                  > > being an attack on IE
                  >
                  > I scan every line for null and other characters. I am guessing that
                  > the null is lost in the browser's paste process. In production,
                  > inspection will be done on files, so I don't think that will be a
                  problem.
                  >
                • Douglas Crockford
                  ... We ll need to test that WScript.StdIn.ReadAll passes the nulls through. I think I have everything else that you identified covered.
                  Message 8 of 30 , Oct 17, 2007
                  View Source
                  • 0 Attachment
                    --- In caplet@yahoogroups.com, "collin_jackson" <collinj@...> wrote:
                    >
                    > I'm not pasting. I'm reading the value of a textarea into JSLint
                    > directly using JavaScript.
                    >
                    > See http://crypto.stanford.edu/jsonrequest/nullbyte2.html
                    >
                    > It looks like Firefox is converting null bytes to Unicode character
                    > 65533, which isn't rejected by JSLint. So all you need to do is reject
                    > Unicode character 65533 to defeat this attack.
                    >
                    > (Note that null bytes vanish in IE, which is fine as long as Firefox
                    > rejects them.)

                    We'll need to test that WScript.StdIn.ReadAll passes the nulls
                    through. I think I have everything else that you identified covered.
                  • Mike Samuel
                    RFC 3986 disallows the null byte in URIs, and says URIs are sequences of bytes, not characters, so 65533 is out of range. In your attribute whitelist, can t
                    Message 9 of 30 , Oct 17, 2007
                    View Source
                    • 0 Attachment
                      RFC 3986 disallows the null byte in URIs, and says URIs are sequences of bytes, not characters, so 65533 is out of range.

                      In your attribute whitelist, can't you identify all whose value is a URI or URI Reference, and restrict the unescaped value to the union of the reserved and unreserved characters and '%'.
                            unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

                      reserved = gen-delims / sub-delims

                      gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

                      sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
                      / "*" / "+" / "," / ";" / "="

                      cheers,
                      mike

                      On 17/10/2007, collin_jackson <collinj@...> wrote:

                      I'm not pasting. I'm reading the value of a textarea into JSLint
                      directly using JavaScript.

                      See http://crypto.stanford.edu/jsonrequest/nullbyte2.html

                      It looks like Firefox is converting null bytes to Unicode character
                      65533, which isn't rejected by JSLint. So all you need to do is reject
                      Unicode character 65533 to defeat this attack.

                      (Note that null bytes vanish in IE, which is fine as long as Firefox
                      rejects them.)

                      --- In caplet@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
                      >
                      > --- In caplet@yahoogroups.com, "collin_jackson" <collinj@> wrote:
                      > >
                      > > Null byte between "java" and "script" passes JSLint on Firefox
                      despite
                      > > being an attack on IE
                      >
                      > I scan every line for null and other characters. I am guessing that
                      > the null is lost in the browser's paste process. In production,
                      > inspection will be done on files, so I don't think that will be a
                      problem.
                      >


                    • David Hopwood
                      ... The diversity of possible attacks on HTML, and the difficulty in keeping up with any changes in browsers, suggests to me that it may be a better idea
                      Message 10 of 30 , Oct 18, 2007
                      View Source
                      • 0 Attachment
                        collin_jackson wrote:
                        > Null byte between "java" and "script" passes JSLint on Firefox despite
                        > being an attack on IE: <iframe src="java�script:alert(42)"></iframe>
                        >
                        > Also:
                        >
                        > <iframe src="data:text/html,<body onload=alert(42) />"></iframe>

                        The diversity of possible attacks on HTML, and the difficulty in keeping
                        up with any changes in browsers, suggests to me that it may be a better
                        idea simply not to support direct HTML embedding. Apart from the latency
                        cost of fetching a script from a separate URL, is there any other reason
                        to support it?

                        --
                        David Hopwood <david.hopwood@...>
                      • Mike Samuel
                        It s tough to write a useful application for a browser if you can t manipulate html. On 18/10/2007, David Hopwood
                        Message 11 of 30 , Oct 18, 2007
                        View Source
                        • 0 Attachment
                          It's tough to write a useful application for a browser if you can't manipulate html.



                          On 18/10/2007, David Hopwood < david.hopwood@...> wrote:

                          collin_jackson wrote:
                          > Null byte between "java" and "script" passes JSLint on Firefox despite
                          > being an attack on IE: <iframe src="java&#65533;script:alert(42)"></iframe>
                          >
                          > Also:
                          >
                          > <iframe src="data:text/html,<body onload=alert(42) />"></iframe>

                          The diversity of possible attacks on HTML, and the difficulty in keeping
                          up with any changes in browsers, suggests to me that it may be a better
                          idea simply not to support direct HTML embedding. Apart from the latency
                          cost of fetching a script from a separate URL, is there any other reason
                          to support it?

                          --
                          David Hopwood <david.hopwood@...>


                        • David Hopwood
                          ... The most common approach to preventing XSS attacks in user-generated content is not to allow HTML in that content, but to translate some simpler mark-up
                          Message 12 of 30 , Oct 18, 2007
                          View Source
                          • 0 Attachment
                            Mike Samuel wrote:
                            > It's tough to write a useful application for a browser if you can't
                            > manipulate html.

                            The most common approach to preventing XSS attacks in user-generated content
                            is not to allow HTML in that content, but to translate some simpler mark-up
                            (e.g. BBCode or a wiki mark-up) into HTML.

                            Even an HTML-to-HTML translation (which quotes the hell out of any
                            suspicious characters in the input) is easier to do correctly than trying
                            to *filter* HTML correctly. The problem with filtering, as the attacks
                            shown so far have demonstrated, is that any tiny difference in the
                            interpretation of comments, strings, or anything else that changes the
                            "mode" of the parser, turns into an exploitable bug.

                            --
                            David Hopwood <david.hopwood@...>
                          • collin_jackson
                            The read-only aspect of JSLint is fairly unique and makes it somewhat more useful for certain applications. I support having a tool that does rewriting as an
                            Message 13 of 30 , Oct 18, 2007
                            View Source
                            • 0 Attachment
                              The read-only aspect of JSLint is fairly unique and makes it somewhat
                              more useful for certain applications. I support having a tool that
                              does rewriting as an alternative to JSLint, but I don't think JSLint
                              should be allowed to do rewriting.

                              --- In caplet@yahoogroups.com, David Hopwood <david.hopwood@...> wrote:
                              >
                              > Mike Samuel wrote:
                              > > It's tough to write a useful application for a browser if you can't
                              > > manipulate html.
                              >
                              > The most common approach to preventing XSS attacks in user-generated
                              content
                              > is not to allow HTML in that content, but to translate some simpler
                              mark-up
                              > (e.g. BBCode or a wiki mark-up) into HTML.
                              >
                              > Even an HTML-to-HTML translation (which quotes the hell out of any
                              > suspicious characters in the input) is easier to do correctly than
                              trying
                              > to *filter* HTML correctly. The problem with filtering, as the attacks
                              > shown so far have demonstrated, is that any tiny difference in the
                              > interpretation of comments, strings, or anything else that changes the
                              > "mode" of the parser, turns into an exploitable bug.
                              >
                              > --
                              > David Hopwood <david.hopwood@...>
                              >
                            • Mike Samuel
                              ... There are two problems here: (1) Identifying a safe subset of HTML/CSS and Javascript -- without obscure extensions like expression() (2) The other is
                              Message 14 of 30 , Oct 18, 2007
                              View Source
                              • 0 Attachment
                                On 18/10/2007, David Hopwood <david.hopwood@...> wrote:
                                >
                                >
                                >
                                >
                                >
                                >
                                > Mike Samuel wrote:
                                > > It's tough to write a useful application for a browser if you can't
                                > > manipulate html.
                                >
                                > The most common approach to preventing XSS attacks in user-generated content
                                > is not to allow HTML in that content, but to translate some simpler mark-up
                                > (e.g. BBCode or a wiki mark-up) into HTML.

                                There are two problems here:
                                (1) Identifying a safe subset of HTML/CSS and Javascript -- without
                                obscure extensions like expression()
                                (2) The other is making sure the browser interprets it the same way
                                you do -- rejecting malformed markup, conditional compilation
                                comments, etc.

                                The first is not solved by translating. You can filter to a safe
                                subset just as easily as translating to a safe subset.

                                The second is an easier problem. If you start with a validating XHTML
                                parser, and relax constraints as you convince yourself it's safe, then
                                you can be confident that the browser will produce the same parse
                                tree.



                                >
                                > Even an HTML-to-HTML translation (which quotes the hell out of any
                                > suspicious characters in the input) is easier to do correctly than trying
                                > to *filter* HTML correctly. The problem with filtering, as the attacks
                                > shown so far have demonstrated, is that any tiny difference in the
                                > interpretation of comments, strings, or anything else that changes the
                                > "mode" of the parser, turns into an exploitable bug.
                                >
                                > --
                                > David Hopwood <david.hopwood@...>
                                >
                                >
                              • Douglas Crockford
                                ... The set of HTML confusions is vast, but not infinite. An advantage here is that JSLint/ADsafe does not have to pass all valid HTML. I can be semidraconian
                                Message 15 of 30 , Oct 19, 2007
                                View Source
                                • 0 Attachment
                                  --- In caplet@yahoogroups.com, "Mike Samuel" <mikesamuel@...> wrote:

                                  > There are two problems here:
                                  > (1) Identifying a safe subset of HTML/CSS and Javascript -- without
                                  > obscure extensions like expression()
                                  > (2) The other is making sure the browser interprets it the same way
                                  > you do -- rejecting malformed markup, conditional compilation
                                  > comments, etc.
                                  >
                                  > The first is not solved by translating. You can filter to a safe
                                  > subset just as easily as translating to a safe subset.
                                  >
                                  > The second is an easier problem. If you start with a validating XHTML
                                  > parser, and relax constraints as you convince yourself it's safe, then
                                  > you can be confident that the browser will produce the same parse
                                  > tree.

                                  The set of HTML confusions is vast, but not infinite. An advantage
                                  here is that JSLint/ADsafe does not have to pass all valid HTML. I can
                                  be semidraconian in rejecting forms on the grounds that they look
                                  suspicious or faulty. That permits some alexandrian solutions. For
                                  example, I don't have to understand the amazing complexity of
                                  entities. I can simply forbid the use of ampersand in some contexts.
                                  Ampersands are sometimes useful and sometimes valid, but I can still
                                  exclude them from my subset simply because I don't want to deal with
                                  them. Having done that, I feel more confident that I won't be surprised.
                                • collin_jackson
                                  Here s another one:
                                  Message 16 of 30 , Oct 19, 2007
                                  View Source
                                  • 0 Attachment
                                    Here's another one:

                                    <iframe/src="javascript:alert(42)"></iframe>

                                    --- In caplet@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
                                    >
                                    > --- In caplet@yahoogroups.com, "collin_jackson" <collinj@> wrote:
                                    > >
                                    > > <div x="\"><img onload=alert(42)
                                    > > src=http://json.org/img/json160.gif>"></div>
                                    >
                                    > Excellent. Keep them coming.
                                    >
                                  • Adam Barth
                                    Why is ADsafe allowing invalid HTML at all? It seems like requiring the HTML to be well-formed is a good first step in trying to understand how it will be
                                    Message 17 of 30 , Oct 19, 2007
                                    View Source
                                    • 0 Attachment
                                      Why is ADsafe allowing invalid HTML at all? It seems like requiring
                                      the HTML to be well-formed is a good first step in trying to
                                      understand how it will be executed in different browsers.

                                      On 10/19/07, collin_jackson <collinj@...> wrote:
                                      > Here's another one:
                                      >
                                      > <iframe/src="javascript:alert(42)"></iframe>
                                      >
                                      > --- In caplet@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
                                      > >
                                      > > --- In caplet@yahoogroups.com, "collin_jackson" <collinj@> wrote:
                                      > > >
                                      > > > <div x="\"><img onload=alert(42)
                                      > > > src=http://json.org/img/json160.gif>"></div>
                                      > >
                                      > > Excellent. Keep them coming.
                                      > >
                                      >
                                    • Adam Barth
                                      One simple way to approximate this (if you didn t want to reuse someone else s code for validating HTML) would be to serialize your parsed HTML back to an
                                      Message 18 of 30 , Oct 19, 2007
                                      View Source
                                      • 0 Attachment
                                        One simple way to approximate this (if you didn't want to reuse
                                        someone else's code for validating HTML) would be to serialize your
                                        parsed HTML back to an octet-stream and compare it with the input
                                        (probably being tolerate of whitespace and capitalization in the
                                        appropriate places).

                                        On 10/19/07, Adam Barth <hk9565@...> wrote:
                                        > Why is ADsafe allowing invalid HTML at all? It seems like requiring
                                        > the HTML to be well-formed is a good first step in trying to
                                        > understand how it will be executed in different browsers.
                                        >
                                        > On 10/19/07, collin_jackson <collinj@...> wrote:
                                        > > Here's another one:
                                        > >
                                        > > <iframe/src="javascript:alert(42)"></iframe>
                                        > >
                                        > > --- In caplet@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
                                        > > >
                                        > > > --- In caplet@yahoogroups.com, "collin_jackson" <collinj@> wrote:
                                        > > > >
                                        > > > > <div x="\"><img onload=alert(42)
                                        > > > > src=http://json.org/img/json160.gif>"></div>
                                        > > >
                                        > > > Excellent. Keep them coming.
                                        > > >
                                        > >
                                        >
                                      • Larry Masinter
                                        I think you got it backward: URIs are sequences of characters, not bytes. and in (X)HTML, URI is really IRI – the XHTML spec allows full Unicode (10646)
                                        Message 19 of 30 , Oct 19, 2007
                                        View Source
                                        • 0 Attachment

                                          I think you got it backward: URIs are sequences of characters, not bytes.  and in (X)HTML, "URI" is really "IRI" – the XHTML spec allows full Unicode (10646) characters which are UTF8 and then hex-encoded if you need an (old-fashioned) URI.

                                           

                                           

                                           

                                          From: caplet@yahoogroups.com [mailto:caplet@yahoogroups.com] On Behalf Of Mike Samuel
                                          Sent: Wednesday, October 17, 2007 4:16 PM
                                          To: caplet@yahoogroups.com
                                          Subject: Re: [caplet] Re: ADsafe, Take 6

                                           

                                          RFC 3986 disallows the null byte in URIs, and says URIs are sequences of bytes, not characters, so 65533 is out of range.

                                          In your attribute whitelist, can't you identify all whose value is a URI or URI Reference, and restrict the unescaped value to the union of the reserved and unreserved characters and '%'.

                                                unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

                                                reserved    = gen-delims / sub-delims

                                                gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"


                                                sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                                                            / "*" / "+" / "," / ";" / "="
                                            


                                          cheers,
                                          mike

                                          On 17/10/2007, collin_jackson <collinj@...> wrote:

                                          I'm not pasting. I'm reading the value of a textarea into JSLint
                                          directly using JavaScript.

                                          See http://crypto.stanford.edu/jsonrequest/nullbyte2.html

                                          It looks like Firefox is converting null bytes to Unicode character
                                          65533, which isn't rejected by JSLint. So all you need to do is reject
                                          Unicode character 65533 to defeat this attack.

                                          (Note that null bytes vanish in IE, which is fine as long as Firefox
                                          rejects them.)

                                          --- In caplet@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:

                                          >
                                          > --- In
                                          target="_blank">caplet@yahoogroups.com, "collin_jackson" <collinj@> wrote:
                                          > >
                                          > > Null byte between "java" and
                                          "script" passes JSLint on Firefox
                                          despite
                                          > > being an attack on IE
                                          >
                                          > I scan every line for null and other characters. I am guessing
                                          that
                                          > the null is lost in the browser's paste process. In
                                          production,
                                          > inspection will be done on files, so I don't think that will
                                          be a
                                          problem.
                                          >

                                           

                                        • Douglas Crockford
                                          ... It shouldn t. So I am grateful to Collin for reporting a case where it did.
                                          Message 20 of 30 , Oct 19, 2007
                                          View Source
                                          • 0 Attachment
                                            --- In caplet@yahoogroups.com, "Adam Barth" <hk9565@...> wrote:

                                            > Why is ADsafe allowing invalid HTML at all?

                                            It shouldn't. So I am grateful to Collin for reporting a case where it
                                            did.
                                          • Mike Samuel
                                            Sorry. I was reading 2396 (not 3986) which says An escaped octet is encoded as a character triplet, consisting of the percent character % followed by the
                                            Message 21 of 30 , Oct 19, 2007
                                            View Source
                                            • 0 Attachment
                                              Sorry.  I was reading 2396 (not 3986) which says
                                                 An escaped octet is encoded as a character triplet, consisting of the
                                              percent character "%" followed by the two hexadecimal digits
                                              representing the octet code. For example, "%20" is the escaped
                                              encoding for the US-ASCII space character.

                                              I think that says that each part of the URI is a sequence of bytes (since only octets can be encoded as hex pairs) which is converted to a sequence of ASCII characters, so you're right in that those characters then need to be encoded for transport.

                                              But I don't understand where full unicode characters are allowed.  Ignoring non-standard extensions like %uxxxx, where are you seeing those?




                                              On 19/10/2007, Larry Masinter <lmm@...> wrote:

                                              I think you got it backward: URIs are sequences of characters, not bytes.  and in (X)HTML, "URI" is really "IRI" – the XHTML spec allows full Unicode (10646) characters which are UTF8 and then hex-encoded if you need an (old-fashioned) URI.

                                               

                                               

                                               

                                              From: caplet@yahoogroups.com [mailto:caplet@yahoogroups.com ] On Behalf Of Mike Samuel
                                              Sent: Wednesday, October 17, 2007 4:16 PM
                                              To: caplet@yahoogroups.com
                                              Subject: Re: [caplet] Re: ADsafe, Take 6

                                               

                                              RFC 3986 disallows the null byte in URIs, and says URIs are sequences of bytes, not characters, so 65533 is out of range.

                                              In your attribute whitelist, can't you identify all whose value is a URI or URI Reference, and restrict the unescaped value to the union of the reserved and unreserved characters and '%'.

                                                    unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

                                                    reserved    = gen-delims / sub-delims

                                                    gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"


                                                    sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                                                                / "*" / "+" / "," / ";" / "="
                                               


                                              cheers,
                                              mike

                                              On 17/10/2007, collin_jackson <collinj@...> wrote:

                                              I'm not pasting. I'm reading the value of a textarea into JSLint
                                              directly using JavaScript.

                                              See http://crypto.stanford.edu/jsonrequest/nullbyte2.html

                                              It looks like Firefox is converting null bytes to Unicode character
                                              65533, which isn't rejected by JSLint. So all you need to do is reject
                                              Unicode character 65533 to defeat this attack.

                                              (Note that null bytes vanish in IE, which is fine as long as Firefox
                                              rejects them.)

                                              --- In caplet@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
                                              >
                                              > --- In caplet@yahoogroups.com, "collin_jackson" <collinj@> wrote:
                                              > >
                                              > > Null byte between "java" and "script" passes JSLint on Firefox
                                              despite
                                              > > being an attack on IE
                                              >
                                              > I scan every line for null and other characters. I am guessing that
                                              > the null is lost in the browser's paste process. In production,
                                              > inspection will be done on files, so I don't think that will be a
                                              problem.
                                              >

                                               


                                            • Adam Barth
                                              ... It seems to be accepting lots of invalid HTML. For example, the simple seems to pass, whereas http://validator.w3.org/check
                                              Message 22 of 30 , Oct 19, 2007
                                              View Source
                                              • 0 Attachment
                                                On 10/19/07, Douglas Crockford <douglas@...> wrote:
                                                > --- In caplet@yahoogroups.com, "Adam Barth" <hk9565@...> wrote:
                                                > > Why is ADsafe allowing invalid HTML at all?
                                                >
                                                > It shouldn't. So I am grateful to Collin for reporting a case where it
                                                > did.

                                                It seems to be accepting lots of invalid HTML. For example, the simple

                                                <iframe xx="yy"></iframe>

                                                seems to pass, whereas http://validator.w3.org/check rejects

                                                <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
                                                "http://www.w3.org/TR/html4/loose.dtd">
                                                <HTML>
                                                <HEAD>
                                                <TITLE>title</TITLE>
                                                </HEAD>
                                                <BODY>
                                                <iframe xx="yy"></iframe>
                                                </BODY>
                                                </HTML>

                                                You might believe the xx attribute isn't dangerous, but the HTML spec
                                                doesn't give it semantics, so we don't really know what it does. It
                                                seems much safer to restrict fragments to the HTML DTD.
                                              • David Hopwood
                                                ... URIs are sequences of characters that encode a sequence of bytes, which *may* in turn encode a sequence of Unicode characters. For URIs that have some
                                                Message 23 of 30 , Oct 19, 2007
                                                View Source
                                                • 0 Attachment
                                                  Larry Masinter wrote:
                                                  > I think you got it backward: URIs are sequences of characters, not bytes.

                                                  URIs are sequences of characters that encode a sequence of bytes, which
                                                  *may* in turn encode a sequence of Unicode characters.

                                                  For URIs that have some server-specific part, the interpretation of the byte
                                                  sequence is up to that server. RFC 3986 *recommends* that, where they encode
                                                  a string, the encoding used should be UTF-8. However, there's no way to
                                                  enforce this (and no particular reason to enforce it). So it is valid, for
                                                  example, to have "%FF" in an URL, even though that is always an invalid byte
                                                  in UTF-8.

                                                  > and in (X)HTML, "URI" is really "IRI" – the XHTML spec allows full
                                                  > Unicode (10646) characters which are UTF8 and then hex-encoded if you need
                                                  > an (old-fashioned) URI.

                                                  XHTML still doesn't require that the sequence of bytes is valid UTF-8.

                                                  In any case, the immediate question was whether it is reasonable to reject
                                                  any input that contains 65533 (U+FFFD REPLACEMENT CHARACTER). IMHO it is:
                                                  this isn't a useful character in its own right; it indicates that an
                                                  encoding error occurred in producing the input.

                                                  --
                                                  David Hopwood <david.hopwood@...>
                                                • Mike Samuel
                                                  ... I still don t understand. My reading of the spec says that the first sequence of characters is in ASCII. If that s the case, then an HTML validator should
                                                  Message 24 of 30 , Oct 19, 2007
                                                  View Source
                                                  • 0 Attachment
                                                    On 19/10/2007, David Hopwood <david.hopwood@...> wrote:
                                                    >
                                                    >
                                                    >
                                                    >
                                                    >
                                                    >
                                                    > Larry Masinter wrote:
                                                    > > I think you got it backward: URIs are sequences of characters, not bytes.
                                                    >
                                                    > URIs are sequences of characters that encode a sequence of bytes, which
                                                    > *may* in turn encode a sequence of Unicode characters.

                                                    I still don't understand.

                                                    My reading of the spec says that the first sequence of characters is in ASCII.

                                                    If that's the case, then an HTML validator should be able to reject
                                                    any HTML attribute of type URI whose value contains a codepoint
                                                    outside [0, 255] without making it possibly to express any valid URI.
                                                    Does that sound right?

                                                    If that's right, would it be appropriate for the error message to
                                                    recommend re-encoding the out of range characters using a %-encoding
                                                    of UTF-8? So "�" -> "%EF%BF%BD".



                                                    Also, on terminology, is the below right?
                                                    * An escaping is an n:1 mapping from strings in an alphabet A to
                                                    strings in an alphabet which is a subset of A.
                                                    * An encoding is a 1:1 mapping of strings over one alphabet to strings
                                                    over another alphabet.



                                                    >
                                                    >
                                                    >
                                                    >
                                                    >
                                                    >
                                                    > For URIs that have some server-specific part, the interpretation of the byte
                                                    > sequence is up to that server. RFC 3986 *recommends* that, where they encode
                                                    > a string, the encoding used should be UTF-8. However, there's no way to
                                                    > enforce this (and no particular reason to enforce it). So it is valid, for
                                                    > example, to have "%FF" in an URL, even though that is always an invalid byte
                                                    > in UTF-8.
                                                    >
                                                    > > and in (X)HTML, "URI" is really "IRI" – the XHTML spec allows full
                                                    > > Unicode (10646) characters which are UTF8 and then hex-encoded if you need
                                                    > > an (old-fashioned) URI.
                                                    >
                                                    > XHTML still doesn't require that the sequence of bytes is valid UTF-8.
                                                    >
                                                    > In any case, the immediate question was whether it is reasonable to reject
                                                    > any input that contains 65533 (U+FFFD REPLACEMENT CHARACTER). IMHO it is:
                                                    > this isn't a useful character in its own right; it indicates that an
                                                    > encoding error occurred in producing the input.
                                                    >
                                                    > --
                                                    > David Hopwood <david.hopwood@...>
                                                  • Larry Masinter
                                                    To answer your direct questions: I don t know any formal definition for escaping except as a part of encoding -- you encode a sequence of bytes into (a
                                                    Message 25 of 30 , Oct 21, 2007
                                                    View Source
                                                    • 0 Attachment
                                                      To answer your direct questions:

                                                      I don't know any formal definition for "escaping" except as a part of "encoding" -- you encode a sequence of bytes into (a subset of) US-ASCII by translating each (allowed) byte into its corresponding ASCII character, but translating some (disallowed) bytes into a different sequence.

                                                      I think it's reasonable to (a) reject Javascript that contains 65533 (U+FFFD REPLACEMENT CHARACTER) and (b) to not look for, parse, or handle in any special way URI references within (X)HTML attributes or content.

                                                      -----------------------------------------

                                                      Talking about this is complicated because the same concept appears with different encodings:

                                                      * The HTTP protocol does its work in sequences of bytes, but the spec is written in terms of "characters".


                                                      * The XML specification defines XML as a sequence of (Unicode) characters, encoded in UTF8 or UTF16 (or some other encoding) but also with &char; character entities and ￝ numeric character references. XHTML as an XML language follows this; whether HTML follows this depends on the HTML version.

                                                      * the Javascript specification (at least ECMA-262) defines Javascript as a sequence of (Unicode) characters encoded in UTF8 or UTF16. E4X (ECMA 357) seems to follow XML in using character entity references and numeric character references to encode characters that would otherwise be disallowed.

                                                      * The URI specification defines a URI as a sequence of characters, taken from the repertoire of US-ASCII characters, with the encoding chosen by the protocol/format that embeds it; it also defines an encoding (%xx) for bytes that would otherwise correspond to disallowed or reserved characters.

                                                      * The IRI specification defines an IRI similarly, but allows a larger repertoire of characters.

                                                      Parsing an XML stream into an XML DOM (including HTML) will translate the UTF8, UTF16 (or other encoding) as well as character and numeric character entity references, into a sequence of characters. Parsing a Javascript string (using E4X) will apparently do the same, even though Javascript-per-se parts and XHTML-constant parts use different escaping mechanisms.

                                                      A (X)HTML validator for content embedded within Javascript should likely perform the same entity and numeric character reference decoding logic as would apply when the Javascript was read and interpreted -- resolve character entity references and numeric character references -- and then validate the results.

                                                      There are many troublesome syntactically valid URIs (and IRIs) that could appear within a URI reference in (X)HTML and (X)HTML embedded within Javascript, but I think it is part of the security requirements of the (X)HTML interpreter runtime to manage and prevent those references. Because URIs (and IRIs) can and sometimes do encode non-character byte streams, looking for or managing the URI-encoding level would be inappropriate.

                                                      Larry








                                                      -----Original Message-----
                                                      From: caplet@yahoogroups.com [mailto:caplet@yahoogroups.com] On Behalf Of Mike Samuel
                                                      Sent: Friday, October 19, 2007 10:30 PM
                                                      To: caplet@yahoogroups.com
                                                      Subject: Re: [caplet] Re: ADsafe, Take 6

                                                      On 19/10/2007, David Hopwood <david.hopwood@...> wrote:
                                                      >
                                                      >
                                                      >
                                                      >
                                                      >
                                                      >
                                                      > Larry Masinter wrote:
                                                      > > I think you got it backward: URIs are sequences of characters, not bytes.
                                                      >
                                                      > URIs are sequences of characters that encode a sequence of bytes, which
                                                      > *may* in turn encode a sequence of Unicode characters.

                                                      I still don't understand.

                                                      My reading of the spec says that the first sequence of characters is in ASCII.

                                                      If that's the case, then an HTML validator should be able to reject
                                                      any HTML attribute of type URI whose value contains a codepoint
                                                      outside [0, 255] without making it possibly to express any valid URI.
                                                      Does that sound right?

                                                      If that's right, would it be appropriate for the error message to
                                                      recommend re-encoding the out of range characters using a %-encoding
                                                      of UTF-8? So "�" -> "%EF%BF%BD".



                                                      Also, on terminology, is the below right?
                                                      * An escaping is an n:1 mapping from strings in an alphabet A to
                                                      strings in an alphabet which is a subset of A.
                                                      * An encoding is a 1:1 mapping of strings over one alphabet to strings
                                                      over another alphabet.



                                                      >
                                                      >
                                                      >
                                                      >
                                                      >
                                                      >
                                                      > For URIs that have some server-specific part, the interpretation of the byte
                                                      > sequence is up to that server. RFC 3986 *recommends* that, where they encode
                                                      > a string, the encoding used should be UTF-8. However, there's no way to
                                                      > enforce this (and no particular reason to enforce it). So it is valid, for
                                                      > example, to have "%FF" in an URL, even though that is always an invalid byte
                                                      > in UTF-8.
                                                      >
                                                      > > and in (X)HTML, "URI" is really "IRI" – the XHTML spec allows full
                                                      > > Unicode (10646) characters which are UTF8 and then hex-encoded if you need
                                                      > > an (old-fashioned) URI.
                                                      >
                                                      > XHTML still doesn't require that the sequence of bytes is valid UTF-8.
                                                      >
                                                      > In any case, the immediate question was whether it is reasonable to reject
                                                      > any input that contains 65533 (U+FFFD REPLACEMENT CHARACTER). IMHO it is:
                                                      > this isn't a useful character in its own right; it indicates that an
                                                      > encoding error occurred in producing the input.
                                                      >
                                                      > --
                                                      > David Hopwood <david.hopwood@...>
                                                    • Mike Samuel
                                                      ... Ok. I think it s useful to make a distinction between the n:1 mappings and the 1:1 mappings. If you re escaping (which I defined as n:1), you have to
                                                      Message 26 of 30 , Oct 21, 2007
                                                      View Source
                                                      • 0 Attachment
                                                        On 21/10/2007, Larry Masinter <lmm@...> wrote:
                                                        >
                                                        >
                                                        >
                                                        >
                                                        >
                                                        >
                                                        > To answer your direct questions:
                                                        >
                                                        > I don't know any formal definition for "escaping" except as a part of "encoding" -- you encode a sequence of bytes into (a subset of) US-ASCII by translating each (allowed) byte into its corresponding ASCII character, but translating some (disallowed) bytes into a different sequence.

                                                        Ok. I think it's useful to make a distinction between the n:1
                                                        mappings and the 1:1 mappings.
                                                        If you're escaping (which I defined as n:1), you have to unescape
                                                        before comparing strings, while you can check against an encoded
                                                        string by either decoding the one or encoding the other.

                                                        One way to check attribute content in a markup language is to keep a
                                                        stack of escaping and encoding conventions as you examine the document
                                                        in the left to right pass.
                                                        To check whether an iframe's src's protocol is javascript: you deal
                                                        with the following stack
                                                        protocol <(%-escaped)> uri <(html-entity-escaped)> html-attribute
                                                        <(UTF-8 encoding)> bytes

                                                        If you're sending a message and you want it to be interpreted as you
                                                        intend, then you have to make sure that the recipient of the message
                                                        will use the same escaping/encodings, and if you want to verify
                                                        properties of the message, then you have to consider every escaping,
                                                        but not necessarily every encoding.

                                                        So for the javascript: check, there are 3 points of attack, but the
                                                        encoding can be considered entirely separately leaving you two.




                                                        >
                                                        > I think it's reasonable to (a) reject Javascript that contains 65533 (U+FFFD REPLACEMENT CHARACTER) and (b) to not look for, parse, or handle in any special way URI references within (X)HTML attributes or content.

                                                        How would you detect urls that can execute or import scripts without
                                                        distinguishing attributes that contain URIs or URI references, given
                                                        that it is a goal of ADSafe to allow iframes to external srcs?



                                                        >
                                                        > Talking about this is complicated because the same concept appears with different encodings:
                                                        >
                                                        > * The HTTP protocol does its work in sequences of bytes, but the spec is written in terms of "characters".
                                                        >
                                                        > * The XML specification defines XML as a sequence of (Unicode) characters, encoded in UTF8 or UTF16 (or some other encoding) but also with &char; character entities and ￝ numeric character references. XHTML as an XML language follows this; whether HTML follows this depends on the HTML version.
                                                        >
                                                        > * the Javascript specification (at least ECMA-262) defines Javascript as a sequence of (Unicode) characters encoded in UTF8 or UTF16. E4X (ECMA 357) seems to follow XML in using character entity references and numeric character references to encode characters that would otherwise be disallowed.
                                                        >
                                                        > * The URI specification defines a URI as a sequence of characters, taken from the repertoire of US-ASCII characters, with the encoding chosen by the protocol/format that embeds it; it also defines an encoding (%xx) for bytes that would otherwise correspond to disallowed or reserved characters.
                                                        >
                                                        > * The IRI specification defines an IRI similarly, but allows a larger repertoire of characters.
                                                        >
                                                        > Parsing an XML stream into an XML DOM (including HTML) will translate the UTF8, UTF16 (or other encoding) as well as character and numeric character entity references, into a sequence of characters. Parsing a Javascript string (using E4X) will apparently do the same, even though Javascript-per-se parts and XHTML-constant parts use different escaping mechanisms.
                                                        >
                                                        > A (X)HTML validator for content embedded within Javascript should likely perform the same entity and numeric character reference decoding logic as would apply when the Javascript was read and interpreted -- resolve character entity references and numeric character references -- and then validate the results.
                                                        >
                                                        > There are many troublesome syntactically valid URIs (and IRIs) that could appear within a URI reference in (X)HTML and (X)HTML embedded within Javascript, but I think it is part of the security requirements of the (X)HTML interpreter runtime to manage and prevent those references. Because URIs (and IRIs) can and sometimes do encode non-character byte streams, looking for or managing the URI-encoding level would be inappropriate.
                                                        >
                                                        > Larry
                                                        >
                                                        > -----Original Message-----
                                                        > From: caplet@yahoogroups.com [mailto:caplet@yahoogroups.com] On Behalf Of Mike Samuel
                                                        > Sent: Friday, October 19, 2007 10:30 PM
                                                        > To: caplet@yahoogroups.com
                                                        > Subject: Re: [caplet] Re: ADsafe, Take 6
                                                        >
                                                        >
                                                        > On 19/10/2007, David Hopwood <david.hopwood@...> wrote:
                                                        > >
                                                        > >
                                                        > >
                                                        > >
                                                        > >
                                                        > >
                                                        > > Larry Masinter wrote:
                                                        > > > I think you got it backward: URIs are sequences of characters, not bytes.
                                                        > >
                                                        > > URIs are sequences of characters that encode a sequence of bytes, which
                                                        > > *may* in turn encode a sequence of Unicode characters.
                                                        >
                                                        > I still don't understand.
                                                        >
                                                        > My reading of the spec says that the first sequence of characters is in ASCII.
                                                        >
                                                        > If that's the case, then an HTML validator should be able to reject
                                                        > any HTML attribute of type URI whose value contains a codepoint
                                                        > outside [0, 255] without making it possibly to express any valid URI.
                                                        > Does that sound right?
                                                        >
                                                        > If that's right, would it be appropriate for the error message to
                                                        > recommend re-encoding the out of range characters using a %-encoding
                                                        > of UTF-8? So "�" -> "%EF%BF%BD".
                                                        >
                                                        > Also, on terminology, is the below right?
                                                        > * An escaping is an n:1 mapping from strings in an alphabet A to
                                                        > strings in an alphabet which is a subset of A.
                                                        > * An encoding is a 1:1 mapping of strings over one alphabet to strings
                                                        > over another alphabet.
                                                        >
                                                        > >
                                                        > >
                                                        > >
                                                        > >
                                                        > >
                                                        > >
                                                        > > For URIs that have some server-specific part, the interpretation of the byte
                                                        > > sequence is up to that server. RFC 3986 *recommends* that, where they encode
                                                        > > a string, the encoding used should be UTF-8. However, there's no way to
                                                        > > enforce this (and no particular reason to enforce it). So it is valid, for
                                                        > > example, to have "%FF" in an URL, even though that is always an invalid byte
                                                        > > in UTF-8.
                                                        > >
                                                        > > > and in (X)HTML, "URI" is really "IRI" – the XHTML spec allows full
                                                        > > > Unicode (10646) characters which are UTF8 and then hex-encoded if you need
                                                        > > > an (old-fashioned) URI.
                                                        > >
                                                        > > XHTML still doesn't require that the sequence of bytes is valid UTF-8.
                                                        > >
                                                        > > In any case, the immediate question was whether it is reasonable to reject
                                                        > > any input that contains 65533 (U+FFFD REPLACEMENT CHARACTER). IMHO it is:
                                                        > > this isn't a useful character in its own right; it indicates that an
                                                        > > encoding error occurred in producing the input.
                                                        > >
                                                        > > --
                                                        > > David Hopwood <david.hopwood@...>
                                                      • Freeman, Tim
                                                        ... Okay, I ll try to say the obvious here -- although no one individual is responsible, we find ourselves in the middle of a big hacked-up pile of conventions
                                                        Message 27 of 30 , Oct 22, 2007
                                                        View Source
                                                        • 0 Attachment
                                                          Quoting Mike Samuel:
                                                          > To check whether an iframe's src's protocol is javascript: you deal
                                                          > with the following stack
                                                          > protocol <(%-escaped)> uri <(html-entity-escaped)> html-attribute
                                                          > <(UTF-8 encoding)> bytes

                                                          Okay, I'll try to say the obvious here -- although no one individual is
                                                          responsible, we find ourselves in the middle of a big hacked-up pile of
                                                          conventions that were put together with insufficient forethought. It
                                                          shouldn't
                                                          be this complicated to fetch information from a few computers and
                                                          display
                                                          it on another.

                                                          Has anyone put some thought into figuring out how the web should have
                                                          been and
                                                          writing it down? I feel in need of a comforting fantasy to read just
                                                          before
                                                          going to bed at night. Such a document would also help to visualize
                                                          some desired
                                                          destination so we can move in that direction when we make new stuff.
                                                          -----
                                                          Tim Freeman
                                                          Email: tim.freeman@...
                                                          Desk in Palo Alto: (650) 857-2581
                                                          Home: (408) 774-1298
                                                          Cell: (408) 348-7536


                                                          > -----Original Message-----
                                                          > From: caplet@yahoogroups.com [mailto:caplet@yahoogroups.com]
                                                          > On Behalf Of Mike Samuel
                                                          > Sent: Sunday, October 21, 2007 21:11
                                                          > To: caplet@yahoogroups.com
                                                          > Subject: Re: [caplet] Re: ADsafe, Take 6
                                                          >
                                                          > On 21/10/2007, Larry Masinter <lmm@...> wrote:
                                                          > >
                                                          > >
                                                          > >
                                                          > >
                                                          > >
                                                          > >
                                                          > > To answer your direct questions:
                                                          > >
                                                          > > I don't know any formal definition for "escaping" except
                                                          > as a part of "encoding" -- you encode a sequence of bytes
                                                          > into (a subset of) US-ASCII by translating each (allowed)
                                                          > byte into its corresponding ASCII character, but translating
                                                          > some (disallowed) bytes into a different sequence.
                                                          >
                                                          > Ok. I think it's useful to make a distinction between the n:1
                                                          > mappings and the 1:1 mappings.
                                                          > If you're escaping (which I defined as n:1), you have to unescape
                                                          > before comparing strings, while you can check against an encoded
                                                          > string by either decoding the one or encoding the other.
                                                          >
                                                          > One way to check attribute content in a markup language is to keep a
                                                          > stack of escaping and encoding conventions as you examine the document
                                                          > in the left to right pass.
                                                          > To check whether an iframe's src's protocol is javascript: you deal
                                                          > with the following stack
                                                          > protocol <(%-escaped)> uri <(html-entity-escaped)> html-attribute
                                                          > <(UTF-8 encoding)> bytes
                                                          >
                                                          > If you're sending a message and you want it to be interpreted as you
                                                          > intend, then you have to make sure that the recipient of the message
                                                          > will use the same escaping/encodings, and if you want to verify
                                                          > properties of the message, then you have to consider every escaping,
                                                          > but not necessarily every encoding.
                                                          >
                                                          > So for the javascript: check, there are 3 points of attack, but the
                                                          > encoding can be considered entirely separately leaving you two.
                                                          >
                                                          >
                                                          >
                                                          >
                                                          > >
                                                          > > I think it's reasonable to (a) reject Javascript that
                                                          > contains 65533 (U+FFFD REPLACEMENT CHARACTER) and (b) to not
                                                          > look for, parse, or handle in any special way URI references
                                                          > within (X)HTML attributes or content.
                                                          >
                                                          > How would you detect urls that can execute or import scripts without
                                                          > distinguishing attributes that contain URIs or URI references, given
                                                          > that it is a goal of ADSafe to allow iframes to external srcs?
                                                          >
                                                          >
                                                          >
                                                          > >
                                                          > > Talking about this is complicated because the same concept
                                                          > appears with different encodings:
                                                          > >
                                                          > > * The HTTP protocol does its work in sequences of bytes,
                                                          > but the spec is written in terms of "characters".
                                                          > >
                                                          > > * The XML specification defines XML as a sequence of
                                                          > (Unicode) characters, encoded in UTF8 or UTF16 (or some other
                                                          > encoding) but also with &char; character entities and
                                                          > ￝ numeric character references. XHTML as an XML
                                                          > language follows this; whether HTML follows this depends on
                                                          > the HTML version.
                                                          > >
                                                          > > * the Javascript specification (at least ECMA-262) defines
                                                          > Javascript as a sequence of (Unicode) characters encoded in
                                                          > UTF8 or UTF16. E4X (ECMA 357) seems to follow XML in using
                                                          > character entity references and numeric character references
                                                          > to encode characters that would otherwise be disallowed.
                                                          > >
                                                          > > * The URI specification defines a URI as a sequence of
                                                          > characters, taken from the repertoire of US-ASCII characters,
                                                          > with the encoding chosen by the protocol/format that embeds
                                                          > it; it also defines an encoding (%xx) for bytes that would
                                                          > otherwise correspond to disallowed or reserved characters.
                                                          > >
                                                          > > * The IRI specification defines an IRI similarly, but
                                                          > allows a larger repertoire of characters.
                                                          > >
                                                          > > Parsing an XML stream into an XML DOM (including HTML)
                                                          > will translate the UTF8, UTF16 (or other encoding) as well as
                                                          > character and numeric character entity references, into a
                                                          > sequence of characters. Parsing a Javascript string (using
                                                          > E4X) will apparently do the same, even though
                                                          > Javascript-per-se parts and XHTML-constant parts use
                                                          > different escaping mechanisms.
                                                          > >
                                                          > > A (X)HTML validator for content embedded within Javascript
                                                          > should likely perform the same entity and numeric character
                                                          > reference decoding logic as would apply when the Javascript
                                                          > was read and interpreted -- resolve character entity
                                                          > references and numeric character references -- and then
                                                          > validate the results.
                                                          > >
                                                          > > There are many troublesome syntactically valid URIs (and
                                                          > IRIs) that could appear within a URI reference in (X)HTML and
                                                          > (X)HTML embedded within Javascript, but I think it is part of
                                                          > the security requirements of the (X)HTML interpreter runtime
                                                          > to manage and prevent those references. Because URIs (and
                                                          > IRIs) can and sometimes do encode non-character byte streams,
                                                          > looking for or managing the URI-encoding level would be inappropriate.
                                                          > >
                                                          > > Larry
                                                          > >
                                                          > > -----Original Message-----
                                                          > > From: caplet@yahoogroups.com
                                                          > [mailto:caplet@yahoogroups.com] On Behalf Of Mike Samuel
                                                          > > Sent: Friday, October 19, 2007 10:30 PM
                                                          > > To: caplet@yahoogroups.com
                                                          > > Subject: Re: [caplet] Re: ADsafe, Take 6
                                                          > >
                                                          > >
                                                          > > On 19/10/2007, David Hopwood
                                                          > <david.hopwood@...> wrote:
                                                          > > >
                                                          > > >
                                                          > > >
                                                          > > >
                                                          > > >
                                                          > > >
                                                          > > > Larry Masinter wrote:
                                                          > > > > I think you got it backward: URIs are sequences of
                                                          > characters, not bytes.
                                                          > > >
                                                          > > > URIs are sequences of characters that encode a sequence
                                                          > of bytes, which
                                                          > > > *may* in turn encode a sequence of Unicode characters.
                                                          > >
                                                          > > I still don't understand.
                                                          > >
                                                          > > My reading of the spec says that the first sequence of
                                                          > characters is in ASCII.
                                                          > >
                                                          > > If that's the case, then an HTML validator should be able to reject
                                                          > > any HTML attribute of type URI whose value contains a codepoint
                                                          > > outside [0, 255] without making it possibly to express any
                                                          > valid URI.
                                                          > > Does that sound right?
                                                          > >
                                                          > > If that's right, would it be appropriate for the error message to
                                                          > > recommend re-encoding the out of range characters using a
                                                          > %-encoding
                                                          > > of UTF-8? So "�" -> "%EF%BF%BD".
                                                          > >
                                                          > > Also, on terminology, is the below right?
                                                          > > * An escaping is an n:1 mapping from strings in an alphabet A to
                                                          > > strings in an alphabet which is a subset of A.
                                                          > > * An encoding is a 1:1 mapping of strings over one
                                                          > alphabet to strings
                                                          > > over another alphabet.
                                                          > >
                                                          > > >
                                                          > > >
                                                          > > >
                                                          > > >
                                                          > > >
                                                          > > >
                                                          > > > For URIs that have some server-specific part, the
                                                          > interpretation of the byte
                                                          > > > sequence is up to that server. RFC 3986 *recommends*
                                                          > that, where they encode
                                                          > > > a string, the encoding used should be UTF-8. However,
                                                          > there's no way to
                                                          > > > enforce this (and no particular reason to enforce it).
                                                          > So it is valid, for
                                                          > > > example, to have "%FF" in an URL, even though that is
                                                          > always an invalid byte
                                                          > > > in UTF-8.
                                                          > > >
                                                          > > > > and in (X)HTML, "URI" is really "IRI" - the XHTML
                                                          > spec allows full
                                                          > > > > Unicode (10646) characters which are UTF8 and then
                                                          > hex-encoded if you need
                                                          > > > > an (old-fashioned) URI.
                                                          > > >
                                                          > > > XHTML still doesn't require that the sequence of bytes
                                                          > is valid UTF-8.
                                                          > > >
                                                          > > > In any case, the immediate question was whether it is
                                                          > reasonable to reject
                                                          > > > any input that contains 65533 (U+FFFD REPLACEMENT
                                                          > CHARACTER). IMHO it is:
                                                          > > > this isn't a useful character in its own right; it
                                                          > indicates that an
                                                          > > > encoding error occurred in producing the input.
                                                          > > >
                                                          > > > --
                                                          > > > David Hopwood <david.hopwood@...>
                                                          >
                                                          >
                                                          >
                                                          > Yahoo! Groups Links
                                                          >
                                                          >
                                                          >
                                                          >
                                                        • Mike Samuel
                                                          Ok. I think the time for debate has passed, but it s a slow Monday so I ll bite :) There s a few problems: (1) Documents embed other documents using a melange
                                                          Message 28 of 30 , Oct 22, 2007
                                                          View Source
                                                          • 0 Attachment
                                                            Ok. I think the time for debate has passed, but it's a slow Monday so
                                                            I'll bite :)

                                                            There's a few problems:
                                                            (1) Documents embed other documents using a melange of separators,
                                                            delimiters, and escaping conventions.
                                                            (2) Developers don't understand escaping and encoding issues, and so
                                                            apps are rife with injection vulnerabilities.
                                                            (3) Protocol designers have an incentive to gloss over such issues
                                                            because a protocol that attracts developers will survive longer than a
                                                            protocol that doesn't.

                                                            XML promised a uniform way of representing s-expressions, so any
                                                            document representable as an s-expression could embed another
                                                            document. XML runs into problems with embedding -- you have to do
                                                            tricks to use XSL to generate XSL; it's horribly verbose; and it makes
                                                            an arbitrary distinction between elements and attributes which kills
                                                            extensibility.

                                                            That aside, some terse form of consistent S-expression document
                                                            representation for all of (HTTP request/response, markup languages,
                                                            stylesheets, and ideally code) would be a good start.

                                                            Lisp style S-expressions avoid escaping conventions entirely, so
                                                            address 2 to some degree. Which is easier to suss out?
                                                            Content-type:text/html

                                                            <html><script>document.write('<head><title>foo<\/head><\/title>')</script></html>
                                                            where from the DOM you have to treat the content of script as an
                                                            opaque string, or
                                                            ('http-response,
                                                            ('headers, ('Content-type, 'text/html)),
                                                            ('body
                                                            ('html, ('script, ('operator-call, 'document, 'write, '('head,
                                                            ('title, 'foo)))))))

                                                            The problems with that single consistent representation is that you
                                                            need an envelope to specify charset and any other encoding steps.
                                                            Even if we reworked all clients to use UTF-8 of unicode, or had
                                                            gateways to handle legacy formats, the envelope still needs to specify
                                                            compression schemes, signatures, and the like.

                                                            But it should be possible to move all the information not needed to
                                                            decode the body from a byte string out of the envelope without
                                                            requiring a hard distinction between transport and persistence
                                                            formats. Lightweight envelopes are nice, because they let you
                                                            sidestep the debate between those who want a human-readable protocols
                                                            and those who care about size.

                                                            The other major problem is efficient representation of binary data.
                                                            One way is the content: url -- the reference to the binary can specify
                                                            the binary which is horrible and brings encoding and reintroduces
                                                            escaping into all layers. The other way is the email way -- the
                                                            envelope contains multiple parts which share a namespace, and that can
                                                            reference one another using relative URIs. Apple's data and resource
                                                            forks might also suggest ways to allow a uniform structured part to
                                                            reference blobs.

                                                            cheers,
                                                            mike



                                                            On 22/10/2007, Freeman, Tim <tim.freeman@...> wrote:
                                                            >
                                                            >
                                                            >
                                                            >
                                                            >
                                                            >
                                                            > Quoting Mike Samuel:
                                                            > > To check whether an iframe's src's protocol is javascript: you deal
                                                            > > with the following stack
                                                            > > protocol <(%-escaped)> uri <(html-entity-escaped)> html-attribute
                                                            > > <(UTF-8 encoding)> bytes
                                                            >
                                                            > Okay, I'll try to say the obvious here -- although no one individual is
                                                            > responsible, we find ourselves in the middle of a big hacked-up pile of
                                                            > conventions that were put together with insufficient forethought. It
                                                            > shouldn't
                                                            > be this complicated to fetch information from a few computers and
                                                            > display
                                                            > it on another.
                                                            >
                                                            > Has anyone put some thought into figuring out how the web should have
                                                            > been and
                                                            > writing it down? I feel in need of a comforting fantasy to read just
                                                            > before
                                                            > going to bed at night. Such a document would also help to visualize
                                                            > some desired
                                                            > destination so we can move in that direction when we make new stuff.




                                                            >
                                                            >
                                                            >
                                                            >
                                                            >
                                                            >
                                                            > -----
                                                            > Tim Freeman
                                                            > Email: tim.freeman@...
                                                            > Desk in Palo Alto: (650) 857-2581
                                                            > Home: (408) 774-1298
                                                            > Cell: (408) 348-7536
                                                            >
                                                            >
                                                            > > -----Original Message-----
                                                            > > From: caplet@yahoogroups.com [mailto:caplet@yahoogroups.com]
                                                            > > On Behalf Of Mike Samuel
                                                            > > Sent: Sunday, October 21, 2007 21:11
                                                            > > To: caplet@yahoogroups.com
                                                            > > Subject: Re: [caplet] Re: ADsafe, Take 6
                                                            > >
                                                            > > On 21/10/2007, Larry Masinter <lmm@...> wrote:
                                                            > > >
                                                            > > >
                                                            > > >
                                                            > > >
                                                            > > >
                                                            > > >
                                                            > > > To answer your direct questions:
                                                            > > >
                                                            > > > I don't know any formal definition for "escaping" except
                                                            > > as a part of "encoding" -- you encode a sequence of bytes
                                                            > > into (a subset of) US-ASCII by translating each (allowed)
                                                            > > byte into its corresponding ASCII character, but translating
                                                            > > some (disallowed) bytes into a different sequence.
                                                            > >
                                                            > > Ok. I think it's useful to make a distinction between the n:1
                                                            > > mappings and the 1:1 mappings.
                                                            > > If you're escaping (which I defined as n:1), you have to unescape
                                                            > > before comparing strings, while you can check against an encoded
                                                            > > string by either decoding the one or encoding the other.
                                                            > >
                                                            > > One way to check attribute content in a markup language is to keep a
                                                            > > stack of escaping and encoding conventions as you examine the document
                                                            > > in the left to right pass.
                                                            > > To check whether an iframe's src's protocol is javascript: you deal
                                                            > > with the following stack
                                                            > > protocol <(%-escaped)> uri <(html-entity-escaped)> html-attribute
                                                            > > <(UTF-8 encoding)> bytes
                                                            > >
                                                            > > If you're sending a message and you want it to be interpreted as you
                                                            > > intend, then you have to make sure that the recipient of the message
                                                            > > will use the same escaping/encodings, and if you want to verify
                                                            > > properties of the message, then you have to consider every escaping,
                                                            > > but not necessarily every encoding.
                                                            > >
                                                            > > So for the javascript: check, there are 3 points of attack, but the
                                                            > > encoding can be considered entirely separately leaving you two.
                                                            > >
                                                            > >
                                                            > >
                                                            > >
                                                            > > >
                                                            > > > I think it's reasonable to (a) reject Javascript that
                                                            > > contains 65533 (U+FFFD REPLACEMENT CHARACTER) and (b) to not
                                                            > > look for, parse, or handle in any special way URI references
                                                            > > within (X)HTML attributes or content.
                                                            > >
                                                            > > How would you detect urls that can execute or import scripts without
                                                            > > distinguishing attributes that contain URIs or URI references, given
                                                            > > that it is a goal of ADSafe to allow iframes to external srcs?
                                                            > >
                                                            > >
                                                            > >
                                                            > > >
                                                            > > > Talking about this is complicated because the same concept
                                                            > > appears with different encodings:
                                                            > > >
                                                            > > > * The HTTP protocol does its work in sequences of bytes,
                                                            > > but the spec is written in terms of "characters".
                                                            > > >
                                                            > > > * The XML specification defines XML as a sequence of
                                                            > > (Unicode) characters, encoded in UTF8 or UTF16 (or some other
                                                            > > encoding) but also with &char; character entities and
                                                            > > ￝ numeric character references. XHTML as an XML
                                                            > > language follows this; whether HTML follows this depends on
                                                            > > the HTML version.
                                                            > > >
                                                            > > > * the Javascript specification (at least ECMA-262) defines
                                                            > > Javascript as a sequence of (Unicode) characters encoded in
                                                            > > UTF8 or UTF16. E4X (ECMA 357) seems to follow XML in using
                                                            > > character entity references and numeric character references
                                                            > > to encode characters that would otherwise be disallowed.
                                                            > > >
                                                            > > > * The URI specification defines a URI as a sequence of
                                                            > > characters, taken from the repertoire of US-ASCII characters,
                                                            > > with the encoding chosen by the protocol/format that embeds
                                                            > > it; it also defines an encoding (%xx) for bytes that would
                                                            > > otherwise correspond to disallowed or reserved characters.
                                                            > > >
                                                            > > > * The IRI specification defines an IRI similarly, but
                                                            > > allows a larger repertoire of characters.
                                                            > > >
                                                            > > > Parsing an XML stream into an XML DOM (including HTML)
                                                            > > will translate the UTF8, UTF16 (or other encoding) as well as
                                                            > > character and numeric character entity references, into a
                                                            > > sequence of characters. Parsing a Javascript string (using
                                                            > > E4X) will apparently do the same, even though
                                                            > > Javascript-per-se parts and XHTML-constant parts use
                                                            > > different escaping mechanisms.
                                                            > > >
                                                            > > > A (X)HTML validator for content embedded within Javascript
                                                            > > should likely perform the same entity and numeric character
                                                            > > reference decoding logic as would apply when the Javascript
                                                            > > was read and interpreted -- resolve character entity
                                                            > > references and numeric character references -- and then
                                                            > > validate the results.
                                                            > > >
                                                            > > > There are many troublesome syntactically valid URIs (and
                                                            > > IRIs) that could appear within a URI reference in (X)HTML and
                                                            > > (X)HTML embedded within Javascript, but I think it is part of
                                                            > > the security requirements of the (X)HTML interpreter runtime
                                                            > > to manage and prevent those references. Because URIs (and
                                                            > > IRIs) can and sometimes do encode non-character byte streams,
                                                            > > looking for or managing the URI-encoding level would be inappropriate.
                                                            > > >
                                                            > > > Larry
                                                            > > >
                                                            > > > -----Original Message-----
                                                            > > > From: caplet@yahoogroups.com
                                                            > > [mailto:caplet@yahoogroups.com] On Behalf Of Mike Samuel
                                                            > > > Sent: Friday, October 19, 2007 10:30 PM
                                                            > > > To: caplet@yahoogroups.com
                                                            > > > Subject: Re: [caplet] Re: ADsafe, Take 6
                                                            > > >
                                                            > > >
                                                            > > > On 19/10/2007, David Hopwood
                                                            > > <david.hopwood@...> wrote:
                                                            > > > >
                                                            > > > >
                                                            > > > >
                                                            > > > >
                                                            > > > >
                                                            > > > >
                                                            > > > > Larry Masinter wrote:
                                                            > > > > > I think you got it backward: URIs are sequences of
                                                            > > characters, not bytes.
                                                            > > > >
                                                            > > > > URIs are sequences of characters that encode a sequence
                                                            > > of bytes, which
                                                            > > > > *may* in turn encode a sequence of Unicode characters.
                                                            > > >
                                                            > > > I still don't understand.
                                                            > > >
                                                            > > > My reading of the spec says that the first sequence of
                                                            > > characters is in ASCII.
                                                            > > >
                                                            > > > If that's the case, then an HTML validator should be able to reject
                                                            > > > any HTML attribute of type URI whose value contains a codepoint
                                                            > > > outside [0, 255] without making it possibly to express any
                                                            > > valid URI.
                                                            > > > Does that sound right?
                                                            > > >
                                                            > > > If that's right, would it be appropriate for the error message to
                                                            > > > recommend re-encoding the out of range characters using a
                                                            > > %-encoding
                                                            > > > of UTF-8? So "�" -> "%EF%BF%BD".
                                                            > > >
                                                            > > > Also, on terminology, is the below right?
                                                            > > > * An escaping is an n:1 mapping from strings in an alphabet A to
                                                            > > > strings in an alphabet which is a subset of A.
                                                            > > > * An encoding is a 1:1 mapping of strings over one
                                                            > > alphabet to strings
                                                            > > > over another alphabet.
                                                            > > >
                                                            > > > >
                                                            > > > >
                                                            > > > >
                                                            > > > >
                                                            > > > >
                                                            > > > >
                                                            > > > > For URIs that have some server-specific part, the
                                                            > > interpretation of the byte
                                                            > > > > sequence is up to that server. RFC 3986 *recommends*
                                                            > > that, where they encode
                                                            > > > > a string, the encoding used should be UTF-8. However,
                                                            > > there's no way to
                                                            > > > > enforce this (and no particular reason to enforce it).
                                                            > > So it is valid, for
                                                            > > > > example, to have "%FF" in an URL, even though that is
                                                            > > always an invalid byte
                                                            > > > > in UTF-8.
                                                            > > > >
                                                            > > > > > and in (X)HTML, "URI" is really "IRI" - the XHTML
                                                            > > spec allows full
                                                            > > > > > Unicode (10646) characters which are UTF8 and then
                                                            > > hex-encoded if you need
                                                            > > > > > an (old-fashioned) URI.
                                                            > > > >
                                                            > > > > XHTML still doesn't require that the sequence of bytes
                                                            > > is valid UTF-8.
                                                            > > > >
                                                            > > > > In any case, the immediate question was whether it is
                                                            > > reasonable to reject
                                                            > > > > any input that contains 65533 (U+FFFD REPLACEMENT
                                                            > > CHARACTER). IMHO it is:
                                                            > > > > this isn't a useful character in its own right; it
                                                            > > indicates that an
                                                            > > > > encoding error occurred in producing the input.
                                                            > > > >
                                                            > > > > --
                                                            > > > > David Hopwood <david.hopwood@...>
                                                            > >
                                                            > >
                                                          • Larry Masinter
                                                            On standards: The benefit of HTTP and XML and HTML is not that they are well-designed protocol and syntax and language, but that there are many different and
                                                            Message 29 of 30 , Oct 23, 2007
                                                            View Source
                                                            • 0 Attachment
                                                              On standards:

                                                              The benefit of HTTP and XML and HTML is not that they are well-designed protocol and syntax and language, but that there are many different and (more-or-less) interoperable implementations for many operating systems and languages and well-deployed support infrastructure; with enough general agreement about them at the lower layers that you can get on with it defining the next layer up. So a "let's redesign them to be cleaner" effort isn't helpful, really. You'd have to be 10 times or 100 times better before getting traction.

                                                              Protocol designers don't "gloss over" escaping; protocol designers are software developers (or maybe developers-gone-bad) for whom escaping is generally an ugly after-the-fact design addition or compromise between allowing everything to be encoded but letting simple cases be encoded concisely. Think of it as Huffman coding at the design level.

                                                              <p><html>... </html></p>

                                                              is how you write html in html, not because &#xxx; and &symbolname; are wonderful quoting mechanisms, but because the &entity; syntax was already there, and inventing another one for what was seemed like an uncommon case appeared unnecessary.

                                                              On quoting:

                                                              No matter what your escaping and encoding system, developers will have problems with them: you either understand the general principle or you don't.

                                                              I know (from ancient experience) that most programmers learning LISP had trouble with thinking about X, (QUOTE X) and (EVAL X) when learning Lisp. The problem is keeping track of the different layers of interpretation – it isn't the syntax.

                                                              Common Lisp added many other escaping conventions: `(let ((,x "abc\"def") (y ',z)) ,w) so it's hard to claim that S-expressions have consistent delimiters.

                                                              On layering of escaping/quoting:

                                                              The multiple layered quoting systems work well enough, because each layer does its own escaping/encoding and unescaping/unencoding and tools either hide or assist with the process. It's only when you're writing a program trying to process multiple layers simultaneously that you have trouble.

                                                              The reason that ADsafe is having trouble is that it is trying to do filtering without actually using the normal layer software for parsing and interpretation, and skip what turns out to be necessary complexity. Try to write a regular expression that will determines whether a Lisp program might divide by zero, and you'd have similar problems.


                                                              On checking URLs:

                                                              I think you can't check for invalid URLs by examining a program's syntax because (a) programs can construct URLs, and you can't check for invalid (vs. valid) URLs any more than you can do all array bounds checking at compile time and (b) the rules for what constitute a "safe" URL are complicated and evolving. After all, a URL is just a reference to a registry of protocols, which requires the registering body define some syntax for how the URL syntax might identify something or invoke some protocol or process. Each URL scheme has its own syntax and story for what might be "safe" to execute in different contexts, but that depends as much on the implementation of the URL-interpreter as anything else.

                                                              If you're going to do dynamic URL safety checking, there's not much point in doing syntactic checking, because you'll get lots of false positives ("this is unsafe" when it isn't) and won't catch any more problems syntactically than would be caught by the run-time check.

                                                              Larry
                                                            • Mike Samuel
                                                              ... Maybe I m being horribly unfair to protocol designers, but implementors do. An example is entities in URIs embedded in HTML. is
                                                              Message 30 of 30 , Oct 23, 2007
                                                              View Source
                                                              • 0 Attachment
                                                                On 23/10/2007, Larry Masinter <lmm@...> wrote:
                                                                >
                                                                >
                                                                >
                                                                >
                                                                >
                                                                >
                                                                > On standards:
                                                                >
                                                                > The benefit of HTTP and XML and HTML is not that they are well-designed protocol and syntax and language, but that there are many different and (more-or-less) interoperable implementations for many operating systems and languages and well-deployed support infrastructure; with enough general agreement about them at the lower layers that you can get on with it defining the next layer up. So a "let's redesign them to be cleaner" effort isn't helpful, really. You'd have to be 10 times or 100 times better before getting traction.
                                                                >
                                                                > Protocol designers don't "gloss over" escaping; protocol designers are software developers (or maybe developers-gone-bad) for whom escaping is generally an ugly after-the-fact design addition or compromise between allowing everything to be encoded but letting simple cases be encoded concisely. Think of it as Huffman coding at the design level.

                                                                Maybe I'm being horribly unfair to protocol designers, but implementors do.

                                                                An example is entities in URIs embedded in HTML.
                                                                <a href="foo?bar=a&baz=b">
                                                                is invalid HTML, but browser implementors, faced with a bunch of
                                                                markup writers who don't understand escaping decided that they should
                                                                guess what it means, so it means something vastly different from
                                                                <a href="foo?bar=a&=c">


                                                                > <p><html>... </html></p>
                                                                >
                                                                > is how you write html in html, not because &#xxx; and &symbolname; are wonderful quoting mechanisms, but because the &entity; syntax was already there, and inventing another one for what was seemed like an uncommon case appeared unnecessary.

                                                                That's how it's done today, and that's the source of a huge number of problems.

                                                                And a point I was trying to make (don't remember if I made it) was
                                                                that embedding one structured language in another shouldn't require
                                                                escaping. You should be able to include the parse tree directly using
                                                                some kind of consistent (quote ...) syntax.


                                                                > On quoting:
                                                                >
                                                                > No matter what your escaping and encoding system, developers will have problems with them: you either understand the general principle or you don't.

                                                                Agreed. But there's a third possibility:

                                                                I understand them and I've got deadlines and it'll usually work unless
                                                                some lunatic gives their kid a name with apostrophes, so I'll not
                                                                think through it now, and then come back to it once I've gotten some
                                                                sleep which'll happen never.


                                                                > I know (from ancient experience) that most programmers learning LISP had trouble with thinking about X, (QUOTE X) and (EVAL X) when learning Lisp. The problem is keeping track of the different layers of interpretation – it isn't the syntax.

                                                                I think there's something fundamentally different about embedding by
                                                                escaping and embedding by injecting attaching another languages parse
                                                                tree to your own. Escaping requires clients of your language to be
                                                                able to also parse any embeddable languages, and agree on the parse
                                                                tree that any given string should produce.


                                                                > Common Lisp added many other escaping conventions: `(let ((,x "abc\"def") (y ',z)) ,w) so it's hard to claim that S-expressions have consistent delimiters.

                                                                Fair enough, but that's only a concern if you are embedding using escaping.

                                                                > On layering of escaping/quoting:
                                                                >
                                                                > The multiple layered quoting systems work well enough, because each layer does its own escaping/encoding and unescaping/unencoding and tools either hide or assist with the process. It's only when you're writing a program trying to process multiple layers simultaneously that you have trouble.

                                                                Quite right. If you're trying to identify a "secure" subset of any
                                                                system (like web applications) which use multiple languages then you
                                                                have two choices:
                                                                - identify a safe subset of each language individually
                                                                - deal with all languages at once and try to identify a safe subset of the union

                                                                The first approach is certainly easier but has
                                                                least-common-denominator problems -- you have to exclude things that
                                                                might be allowable under the second approach.

                                                                An example
                                                                <a href="javascript:foo()">clicky</a>

                                                                There's three languages in play here, Javascript, URIs, and HTML.

                                                                If you consider the three individually, you have to conclude that the
                                                                HTML is safe, the URI is not, and you never consider the javascript
                                                                itself.

                                                                If you deal with all languages at once, then you can apply your
                                                                javascript verification recursively to the URI.

                                                                To do that you have to parse all three languages and then hope that
                                                                your parse trees agree with browsers' interpretations. My point is
                                                                that's easier if you have a unified parse tree representation for all
                                                                languages that appear in a document.


                                                                > The reason that ADsafe is having trouble is that it is trying to do filtering without actually using the normal layer software for parsing and interpretation, and skip what turns out to be necessary complexity. Try to write a regular expression that will determines whether a Lisp program might divide by zero, and you'd have similar problems.
                                                                >
                                                                >
                                                                > On checking URLs:
                                                                >
                                                                > I think you can't check for invalid URLs by examining a program's syntax because (a) programs can construct URLs, and you can't check for invalid (vs. valid) URLs any more than you can do all array bounds checking at compile time and (b) the rules for what constitute a "safe" URL are complicated and evolving. After all, a URL is just a reference to a registry of protocols, which requires the registering body define some syntax for how the URL syntax might identify something or invoke some protocol or process. Each URL scheme has its own syntax and story for what might be "safe" to execute in different contexts, but that depends as much on the implementation of the URL-interpreter as anything else.

                                                                I disagree. URLs are more than a pointer into a registry of handlers.
                                                                Specifically, javascript: and data: URLs contain data that can be
                                                                classified.

                                                                Again, you'll get fewer false negatives if you do that classification
                                                                in the context of the larger document.


                                                                > If you're going to do dynamic URL safety checking, there's not much point in doing syntactic checking, because you'll get lots of false positives ("this is unsafe" when it isn't) and won't catch any more problems syntactically than would be caught by the run-time check.

                                                                There's always reason to do static checking since it let's you skip
                                                                runtime checks :) But runtime checks are out of scope for ADSafe.


                                                                > Larry
                                                              Your message has been successfully submitted and would be delivered to recipients shortly.