Loading ...
Sorry, an error occurred while loading the content.

119RE: [caplet] Re: ADsafe, Take 6

Expand Messages
  • Larry Masinter
    Oct 21, 2007
      To answer your direct questions:

      I don't know any formal definition for "escaping" except as a part of "encoding" -- you encode a sequence of bytes into (a subset of) US-ASCII by translating each (allowed) byte into its corresponding ASCII character, but translating some (disallowed) bytes into a different sequence.

      I think it's reasonable to (a) reject Javascript that contains 65533 (U+FFFD REPLACEMENT CHARACTER) and (b) to not look for, parse, or handle in any special way URI references within (X)HTML attributes or content.

      -----------------------------------------

      Talking about this is complicated because the same concept appears with different encodings:

      * The HTTP protocol does its work in sequences of bytes, but the spec is written in terms of "characters".


      * The XML specification defines XML as a sequence of (Unicode) characters, encoded in UTF8 or UTF16 (or some other encoding) but also with &char; character entities and ￝ numeric character references. XHTML as an XML language follows this; whether HTML follows this depends on the HTML version.

      * the Javascript specification (at least ECMA-262) defines Javascript as a sequence of (Unicode) characters encoded in UTF8 or UTF16. E4X (ECMA 357) seems to follow XML in using character entity references and numeric character references to encode characters that would otherwise be disallowed.

      * The URI specification defines a URI as a sequence of characters, taken from the repertoire of US-ASCII characters, with the encoding chosen by the protocol/format that embeds it; it also defines an encoding (%xx) for bytes that would otherwise correspond to disallowed or reserved characters.

      * The IRI specification defines an IRI similarly, but allows a larger repertoire of characters.

      Parsing an XML stream into an XML DOM (including HTML) will translate the UTF8, UTF16 (or other encoding) as well as character and numeric character entity references, into a sequence of characters. Parsing a Javascript string (using E4X) will apparently do the same, even though Javascript-per-se parts and XHTML-constant parts use different escaping mechanisms.

      A (X)HTML validator for content embedded within Javascript should likely perform the same entity and numeric character reference decoding logic as would apply when the Javascript was read and interpreted -- resolve character entity references and numeric character references -- and then validate the results.

      There are many troublesome syntactically valid URIs (and IRIs) that could appear within a URI reference in (X)HTML and (X)HTML embedded within Javascript, but I think it is part of the security requirements of the (X)HTML interpreter runtime to manage and prevent those references. Because URIs (and IRIs) can and sometimes do encode non-character byte streams, looking for or managing the URI-encoding level would be inappropriate.

      Larry








      -----Original Message-----
      From: caplet@yahoogroups.com [mailto:caplet@yahoogroups.com] On Behalf Of Mike Samuel
      Sent: Friday, October 19, 2007 10:30 PM
      To: caplet@yahoogroups.com
      Subject: Re: [caplet] Re: ADsafe, Take 6

      On 19/10/2007, David Hopwood <david.hopwood@...> wrote:
      >
      >
      >
      >
      >
      >
      > Larry Masinter wrote:
      > > I think you got it backward: URIs are sequences of characters, not bytes.
      >
      > URIs are sequences of characters that encode a sequence of bytes, which
      > *may* in turn encode a sequence of Unicode characters.

      I still don't understand.

      My reading of the spec says that the first sequence of characters is in ASCII.

      If that's the case, then an HTML validator should be able to reject
      any HTML attribute of type URI whose value contains a codepoint
      outside [0, 255] without making it possibly to express any valid URI.
      Does that sound right?

      If that's right, would it be appropriate for the error message to
      recommend re-encoding the out of range characters using a %-encoding
      of UTF-8? So "�" -> "%EF%BF%BD".



      Also, on terminology, is the below right?
      * An escaping is an n:1 mapping from strings in an alphabet A to
      strings in an alphabet which is a subset of A.
      * An encoding is a 1:1 mapping of strings over one alphabet to strings
      over another alphabet.



      >
      >
      >
      >
      >
      >
      > For URIs that have some server-specific part, the interpretation of the byte
      > sequence is up to that server. RFC 3986 *recommends* that, where they encode
      > a string, the encoding used should be UTF-8. However, there's no way to
      > enforce this (and no particular reason to enforce it). So it is valid, for
      > example, to have "%FF" in an URL, even though that is always an invalid byte
      > in UTF-8.
      >
      > > and in (X)HTML, "URI" is really "IRI" – the XHTML spec allows full
      > > Unicode (10646) characters which are UTF8 and then hex-encoded if you need
      > > an (old-fashioned) URI.
      >
      > XHTML still doesn't require that the sequence of bytes is valid UTF-8.
      >
      > In any case, the immediate question was whether it is reasonable to reject
      > any input that contains 65533 (U+FFFD REPLACEMENT CHARACTER). IMHO it is:
      > this isn't a useful character in its own right; it indicates that an
      > encoding error occurred in producing the input.
      >
      > --
      > David Hopwood <david.hopwood@...>
    • Show all 30 messages in this topic