Re: JSON and the Unicode Standard
- --- In email@example.com, "Douglas Crockford" <douglas@...> wrote:
>Not if it's Unicode. It is a common misconception that "Unicode" is just a set of code points, like say ASCII or EBCDIC. It is not. With ASCII, you can "delete" characters from a stream and it's still ASCII. Code points in Unicode have semantics, and deleting them can alter the meaning of the string in surprising and unexpected ways.
> A receiver can do what it chooses to with the character codes it receives. If it wants to delete them or reject them, that is its business.
Previous versions of the Unicode standard used to have a clause that permitted the deleting of code points from a string (though it was not recommended, ala SHOULD NOT). Later versions of the Unicode standard do not permit this, and it has been verboten to do so for some time.
Some of the most compelling reasons why deleting characters is forbidden is covered by the section "Non-Visual Security Issues": http://www.unicode.org/reports/tr36/#Canonical_Represenation .
There is no language in RFC 4627 that I can find that supports your interpretation. There is an awful lot of language in the Unicode standard that unambiguously says that you can not "delete" characters from a Unicode "string". There is also compelling arguments in TR#36 for why arbitrarily deleting characters is a huge mistake.
> But a JSON channel should not interfere with or bias the communication.This statement is contrary to what you just said. If it is deleting characters, it is obviously interfering and biasing the communication.
A standard, and its interpretation, should strive to be unambiguous. An interpretation that boils down to "An implementation MAY interfere with or bias the communication, but an implementation SHOULD NOT interfere with or bias the communication" is meaningless and non-sensical.
> It should faithfully deliver what the sender sent, provided that it conforms to the JSON grammar.Which must be encoded as Unicode. Again, Unicode IS NOT, and MUST NOT be treated as a stream of Unicode code points. That's not Unicode. I freely admit that this is a belief that I once had. However, after a few years of dealing with low level Unicode string processing (where Unicode means "The Unicode Standard"), I no longer hold this view. It's much more complicated and much more nuanced than people realize.
> If the sender wants to send characters that some consortium considers indecent, and if the receiver wants to receive them, then that is their business.I don't have a problem with this. I would have a problem with such a set up claiming "strictly RFC 4627 conforming" (or some language implying 4627 conformance).
My specific point is this: I strongly believe that RFC 4627 requires Unicode, and by implication, processing said Unicode in a Unicode Standard conforming way. Therefore, in order to claim "RFC 4627 conformance", one must also process and handle the JSON in a way that is also "Unicode Standard conforming" as well.
You don't HAVE to do this, obviously.. but then you can no longer claim RFC 4627 conformance.
If I may make a suggestion: perhaps an informal "JSON Best Practices" document be started that catalogs and records these types of things. The document would be totally non-normative, but would be a fantastic resource for this who need to implement JSON parsers and generators. It would also help ensure that implementations converge on something that ensures they will interoperate more reliably. Since it would be non-normative, it wouldn't have any "requirements" weight to it, but I can tell you such a document would have been a big help to me.
- --- In firstname.lastname@example.org, Dave Gamble <davegamble@...> wrote:
>There is another relevant section (ECMA-262, 8.4 The String Type, pg 28)
> To save people looking it up:
> ECMA-262, section 7.6:
> Two IdentifierName that are canonically equivalent according to the
> Unicode standard are not equal unless they are represented by the
> exact same sequence of code units (in other words, conforming
> ECMAScript implementations are only required to do bitwise comparison
> on IdentifierName values). The intent is that the incoming source text
> has been converted to normalised form C before it reaches the
> ECMAScript implementations may recognize identifier characters defined
> in later editions of the Unicode Standard. If portability is a
> concern, programmers should only employ identifier characters defined
> in Unicode 3.0.
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.
NOTE The rationale behind this design was to keep the implementation of Strings as simple and high-performing as possible. The intent is that textual data coming into the execution environment from outside (e.g., user input, text read from a file or received over the network, etc.) be converted to Unicode Normalised Form C before the running program sees it. Usually this would occur at the same time incoming text is converted from its original character encoding to Unicode (and would impose no additional overhead). Since it is recommended that ECMAScript source code be in Normalised Form C, string literals are guaranteed to be normalised (if source text is guaranteed to be normalised), as long as they do not contain any Unicode escape sequences.
> I think it's fairly clear that a JSON parser has ABSOLUTELY NOI disagree with your first statement. The ECMA-262 standard, at least in my opinion, tries to side step a lot of these issues. It makes a fairly clear distinction between "what happens inside the ECMA-262 environment (which it obviously has near total control over)" and "what happens outside the ECMA-262 environment".
> BUSINESS poking around with actual data strings; Douglas has been very
> clear that you are to pass them bit-identical to the recipient. On the
> other hand, there's an argument for some kind of sanitation when it
> comes to object member names.
> I'm really tempted by the idea of a JSON-secure spec, which clamps
> down on these details.
IMHO, the ECMA-262 standard advocates that "stuff that happens outside the ECMA-262 environment should be treated as if it is NFC".
Since the sine qua non of JSON is the interchange of information between different environments and implementations, it must address any issues that can and will cause difficulties. Like it or not, the fact that it's Unicode means these things can and will happen, and it's simply not practical to expect or insist that every implementation treat JSON Strings as "just a simple array of Unicode Code Points".
> Arguing the Unicode details is decidedly NOT compatible with theI completely agree that these details are NOT compatible with the "spirit" of JSON.
> "spirit" of JSON, which Douglas has been very clear about; a
> lightweight, simple, modern data representation.
But.... so what? Unicode is not simple. I'm not the one who made it that way, but the way that RFC 4627 is written, you must deal with it. There are ways RFC 4627 could have been written such that the JSON to be parsed is considered a stream of 8 bit bytes, and therefore stripped of its Unicode semantics (if any). However, it very clearly and plainly says "JSON text SHALL be encoded in Unicode.", which pretty much kills the idea that you can just treat it as raw bytes.
There's a saying about formalized standards: The standard is right. Even it's mistakes.
As an aside, there is a RFC for "Unicode Format for Network Interchange", RFC 5198 (http://tools.ietf.org/html/rfc5198). It is 18 pages long. RFC 4627 is just 9 pages.
Actually, I would encourage people to read RFC 5198. I'm not sure I agree with all of it, but it goes over a lot of the issues I think are very relevant to this conversation. It's great background info if you're not familiar with the details.