Loading ...
Sorry, an error occurred while loading the content.

120Re: [caplet] Re: ADsafe, Take 6

Expand Messages
  • Mike Samuel
    Oct 21, 2007
      On 21/10/2007, Larry Masinter <lmm@...> wrote:
      >
      >
      >
      >
      >
      >
      > To answer your direct questions:
      >
      > I don't know any formal definition for "escaping" except as a part of "encoding" -- you encode a sequence of bytes into (a subset of) US-ASCII by translating each (allowed) byte into its corresponding ASCII character, but translating some (disallowed) bytes into a different sequence.

      Ok. I think it's useful to make a distinction between the n:1
      mappings and the 1:1 mappings.
      If you're escaping (which I defined as n:1), you have to unescape
      before comparing strings, while you can check against an encoded
      string by either decoding the one or encoding the other.

      One way to check attribute content in a markup language is to keep a
      stack of escaping and encoding conventions as you examine the document
      in the left to right pass.
      To check whether an iframe's src's protocol is javascript: you deal
      with the following stack
      protocol <(%-escaped)> uri <(html-entity-escaped)> html-attribute
      <(UTF-8 encoding)> bytes

      If you're sending a message and you want it to be interpreted as you
      intend, then you have to make sure that the recipient of the message
      will use the same escaping/encodings, and if you want to verify
      properties of the message, then you have to consider every escaping,
      but not necessarily every encoding.

      So for the javascript: check, there are 3 points of attack, but the
      encoding can be considered entirely separately leaving you two.




      >
      > I think it's reasonable to (a) reject Javascript that contains 65533 (U+FFFD REPLACEMENT CHARACTER) and (b) to not look for, parse, or handle in any special way URI references within (X)HTML attributes or content.

      How would you detect urls that can execute or import scripts without
      distinguishing attributes that contain URIs or URI references, given
      that it is a goal of ADSafe to allow iframes to external srcs?



      >
      > Talking about this is complicated because the same concept appears with different encodings:
      >
      > * The HTTP protocol does its work in sequences of bytes, but the spec is written in terms of "characters".
      >
      > * The XML specification defines XML as a sequence of (Unicode) characters, encoded in UTF8 or UTF16 (or some other encoding) but also with &char; character entities and ￝ numeric character references. XHTML as an XML language follows this; whether HTML follows this depends on the HTML version.
      >
      > * the Javascript specification (at least ECMA-262) defines Javascript as a sequence of (Unicode) characters encoded in UTF8 or UTF16. E4X (ECMA 357) seems to follow XML in using character entity references and numeric character references to encode characters that would otherwise be disallowed.
      >
      > * The URI specification defines a URI as a sequence of characters, taken from the repertoire of US-ASCII characters, with the encoding chosen by the protocol/format that embeds it; it also defines an encoding (%xx) for bytes that would otherwise correspond to disallowed or reserved characters.
      >
      > * The IRI specification defines an IRI similarly, but allows a larger repertoire of characters.
      >
      > Parsing an XML stream into an XML DOM (including HTML) will translate the UTF8, UTF16 (or other encoding) as well as character and numeric character entity references, into a sequence of characters. Parsing a Javascript string (using E4X) will apparently do the same, even though Javascript-per-se parts and XHTML-constant parts use different escaping mechanisms.
      >
      > A (X)HTML validator for content embedded within Javascript should likely perform the same entity and numeric character reference decoding logic as would apply when the Javascript was read and interpreted -- resolve character entity references and numeric character references -- and then validate the results.
      >
      > There are many troublesome syntactically valid URIs (and IRIs) that could appear within a URI reference in (X)HTML and (X)HTML embedded within Javascript, but I think it is part of the security requirements of the (X)HTML interpreter runtime to manage and prevent those references. Because URIs (and IRIs) can and sometimes do encode non-character byte streams, looking for or managing the URI-encoding level would be inappropriate.
      >
      > Larry
      >
      > -----Original Message-----
      > From: caplet@yahoogroups.com [mailto:caplet@yahoogroups.com] On Behalf Of Mike Samuel
      > Sent: Friday, October 19, 2007 10:30 PM
      > To: caplet@yahoogroups.com
      > Subject: Re: [caplet] Re: ADsafe, Take 6
      >
      >
      > On 19/10/2007, David Hopwood <david.hopwood@...> wrote:
      > >
      > >
      > >
      > >
      > >
      > >
      > > Larry Masinter wrote:
      > > > I think you got it backward: URIs are sequences of characters, not bytes.
      > >
      > > URIs are sequences of characters that encode a sequence of bytes, which
      > > *may* in turn encode a sequence of Unicode characters.
      >
      > I still don't understand.
      >
      > My reading of the spec says that the first sequence of characters is in ASCII.
      >
      > If that's the case, then an HTML validator should be able to reject
      > any HTML attribute of type URI whose value contains a codepoint
      > outside [0, 255] without making it possibly to express any valid URI.
      > Does that sound right?
      >
      > If that's right, would it be appropriate for the error message to
      > recommend re-encoding the out of range characters using a %-encoding
      > of UTF-8? So "�" -> "%EF%BF%BD".
      >
      > Also, on terminology, is the below right?
      > * An escaping is an n:1 mapping from strings in an alphabet A to
      > strings in an alphabet which is a subset of A.
      > * An encoding is a 1:1 mapping of strings over one alphabet to strings
      > over another alphabet.
      >
      > >
      > >
      > >
      > >
      > >
      > >
      > > For URIs that have some server-specific part, the interpretation of the byte
      > > sequence is up to that server. RFC 3986 *recommends* that, where they encode
      > > a string, the encoding used should be UTF-8. However, there's no way to
      > > enforce this (and no particular reason to enforce it). So it is valid, for
      > > example, to have "%FF" in an URL, even though that is always an invalid byte
      > > in UTF-8.
      > >
      > > > and in (X)HTML, "URI" is really "IRI" – the XHTML spec allows full
      > > > Unicode (10646) characters which are UTF8 and then hex-encoded if you need
      > > > an (old-fashioned) URI.
      > >
      > > XHTML still doesn't require that the sequence of bytes is valid UTF-8.
      > >
      > > In any case, the immediate question was whether it is reasonable to reject
      > > any input that contains 65533 (U+FFFD REPLACEMENT CHARACTER). IMHO it is:
      > > this isn't a useful character in its own right; it indicates that an
      > > encoding error occurred in producing the input.
      > >
      > > --
      > > David Hopwood <david.hopwood@...>
    • Show all 30 messages in this topic