119RE: [caplet] Re: ADsafe, Take 6
- Oct 21, 2007To answer your direct questions:
I don't know any formal definition for "escaping" except as a part of "encoding" -- you encode a sequence of bytes into (a subset of) US-ASCII by translating each (allowed) byte into its corresponding ASCII character, but translating some (disallowed) bytes into a different sequence.
Talking about this is complicated because the same concept appears with different encodings:
* The HTTP protocol does its work in sequences of bytes, but the spec is written in terms of "characters".
* The XML specification defines XML as a sequence of (Unicode) characters, encoded in UTF8 or UTF16 (or some other encoding) but also with &char; character entities and numeric character references. XHTML as an XML language follows this; whether HTML follows this depends on the HTML version.
* The URI specification defines a URI as a sequence of characters, taken from the repertoire of US-ASCII characters, with the encoding chosen by the protocol/format that embeds it; it also defines an encoding (%xx) for bytes that would otherwise correspond to disallowed or reserved characters.
* The IRI specification defines an IRI similarly, but allows a larger repertoire of characters.
From: firstname.lastname@example.org [mailto:email@example.com] On Behalf Of Mike Samuel
Sent: Friday, October 19, 2007 10:30 PM
Subject: Re: [caplet] Re: ADsafe, Take 6
On 19/10/2007, David Hopwood <david.hopwood@...> wrote:
> Larry Masinter wrote:
> > I think you got it backward: URIs are sequences of characters, not bytes.
> URIs are sequences of characters that encode a sequence of bytes, which
> *may* in turn encode a sequence of Unicode characters.
I still don't understand.
My reading of the spec says that the first sequence of characters is in ASCII.
If that's the case, then an HTML validator should be able to reject
any HTML attribute of type URI whose value contains a codepoint
outside [0, 255] without making it possibly to express any valid URI.
Does that sound right?
If that's right, would it be appropriate for the error message to
recommend re-encoding the out of range characters using a %-encoding
of UTF-8? So "�" -> "%EF%BF%BD".
Also, on terminology, is the below right?
* An escaping is an n:1 mapping from strings in an alphabet A to
strings in an alphabet which is a subset of A.
* An encoding is a 1:1 mapping of strings over one alphabet to strings
over another alphabet.
> For URIs that have some server-specific part, the interpretation of the byte
> sequence is up to that server. RFC 3986 *recommends* that, where they encode
> a string, the encoding used should be UTF-8. However, there's no way to
> enforce this (and no particular reason to enforce it). So it is valid, for
> example, to have "%FF" in an URL, even though that is always an invalid byte
> in UTF-8.
> > and in (X)HTML, "URI" is really "IRI" – the XHTML spec allows full
> > Unicode (10646) characters which are UTF8 and then hex-encoded if you need
> > an (old-fashioned) URI.
> XHTML still doesn't require that the sequence of bytes is valid UTF-8.
> In any case, the immediate question was whether it is reasonable to reject
> any input that contains 65533 (U+FFFD REPLACEMENT CHARACTER). IMHO it is:
> this isn't a useful character in its own right; it indicates that an
> encoding error occurred in producing the input.
> David Hopwood <david.hopwood@...>
- << Previous post in topic Next post in topic >>