Loading ...
Sorry, an error occurred while loading the content.

Re: JSON and the Unicode Standard

Expand Messages
  • johne_ganz
    ... Not according to RFC 4627 it isn t. Section 3, Encoding, JSON text SHALL be encoded in Unicode. , where SHALL is interpreted via RFC 2119 (i.e., SHALL is
    Message 1 of 35 , Feb 25, 2011
      --- In json@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
      > --- In json@yahoogroups.com, "johne_ganz" <john.engelhart@> wrote:
      > >
      > > --- In json@yahoogroups.com, "Douglas Crockford" <douglas@> wrote:
      > > >
      > > > A receiver can do what it chooses to with the character codes it receives. If it wants to delete them or reject them, that is its business.
      > >
      > > Not if it's Unicode. It is a common misconception that "Unicode" is just a set of code points, like say ASCII or EBCDIC.
      > For JSON's purpose, Unicode is just a set of code points.

      Not according to RFC 4627 it isn't. Section 3, Encoding, "JSON text SHALL be encoded in Unicode.", where SHALL is interpreted via RFC 2119 (i.e., SHALL is synonymous with MUST).

      I appreciate that your interpretation may have been your original intent, but the scope of the language in the standard is far, far greater than "JSON text SHALL be interpreted as a stream of disjoint Unicode code points.", which is what you are arguing that the standard means.

      Unless you can make a compelling argument with language from the RFC 4627 standard, the standard clearly and plainly says that the JSON text is encoded in Unicode. This means that the text must conform to the Unicode standard, and it's rules for processing and handling text MUST (via the use of SHALL in RFC 4627) be followed.

      > By receiver I mean the program that ultimately receives the message. It can interpret it and process it or damage it or ignore it as it will. What it does with the data is none of my business. The JSON channel itself must do none of those things.

      Surely you realize that in practice, this is not the way that things are done. All of the JSON libraries are effectively "part of the JSON channel".

      There is a clear demarcation point where a piece of text has ceased to be JSON and has (usually) become an instantiated data structure in the host language.

      How and what the "language" does with the data is not relevant to RFC 4627. The "language" may manipulate the JSON data, examining keys, manipulating them in any way it chooses. But at this point, it very clearly has ceased to be "JSON".

      Every JSON implementation that is in the form of a library for a host language that I'm aware of could be interpreted to be "the program that ultimately receives the message". The libraries parse the JSON and transliterate it in to a form useable by the host language. How and what the host language, or program written by someone to enumerate or manipulate the data structure that was instantiated from the original JSON is obviously outside the scope of RFC 4627.

      My pedantic point is: A JSON implementation, in the form of a library that provides bindings between a host language and JSON (of which there are many), MUST NOT arbitrarily delete characters in the original JSON. Furthermore, any such implementation MUST interpret the original JSON text in accordance with the Unicode Standard. Just like RFC 4627 gives a grammar and rules for how to interpret JSON, the Unicode Standard has rules for how to interpret text encoded as Unicode. Unicode is not just a simple set of code points.

      Another issue is normalization. In particular, the way normalization is handled for the "key" portion of an "object" (i.e., {"key": "value"}) can dramatically alter the meaning and contents of the object. For example:

      "\u212b": "one",
      "\u0041\u030a": "two",
      "\u00c5": "three"

      Are these three keys distinct? Should there be a requirement that they MUST be handled and interpreted such that they are distinct? Does that requirement extend past the "channel" demarcation point (i.e., not a JSON library or communication channel used to interchange the JSON between two hosts) to the "host language"?

      In case it is not obvious, under the rules of Unicode NFC (Normalization Form C), all three of the keys above will become "\u00c5" after NFC processing.

      A first order approximation would seem to suggest that a JSON implementation "should" use the precomposed form for keys, and for objects that contain keys with non-precomposed keys that, when converted to their precomposed form are duplicate with other keys, the behavior is undefined.

      Again, this is another point where the use of Unicode introduces an awful lot of non-obvious dependencies. The Unicode standard has a lot to say about what it means for two strings to "compare equal", and since JSON specifies what is essentially a key/value hash table, it is critically important to define what "equal" means for a key. If the keys were ASCII or Binary, this would probably be a non-issue, but its a pretty big one when you're dealing with Unicode.

      > Tell you what. If you ever encounter a real problem, we will deal with that.

      This is a rather snarky comment, and to be blunt, unprofessional and unfair.

      Every point I've raised here is something that an implementor of a JSON library will likely encounter. As an implementor of such a library (for Objective-C), everything I've raised here is something that took an enormous amount of time and consideration.

      In my case, I've had to deal with the subtle nuances of what happens to a Unicode string when I parse it and then hand that parsed string off to another library to instantiate a string object. I have no control over how this external library (a combination of Foundation and Core Foundation) deals with or interprets various aspects of the Unicode Standard. For the sake of argument, if this external library automatically precomposes all strings it instantiates, and I have to uses those instantiated strings as the keys in a NSDictionary (the equivalent of a JSON object), I've got some problems.

      Your snarky comment ignores the real world complexities that one faces when attempting to create a "RFC 4627 compliant" JSON implementation, at least if one is trying to do so "the right way" as opposed to a quick hack JSON implementation.

      For someone who is creating a JSON library or some other form of a JSON implementation, the corner cases are usually far more important than the obvious, common case.
    • johne_ganz
      ... There is another relevant section (ECMA-262, 8.4 The String Type, pg 28) When a String contains actual textual data, each element is considered to be a
      Message 35 of 35 , Mar 3, 2011
        --- In json@yahoogroups.com, Dave Gamble <davegamble@...> wrote:
        > To save people looking it up:
        > ECMA-262, section 7.6:
        > Two IdentifierName that are canonically equivalent according to the
        > Unicode standard are not equal unless they are represented by the
        > exact same sequence of code units (in other words, conforming
        > ECMAScript implementations are only required to do bitwise comparison
        > on IdentifierName values). The intent is that the incoming source text
        > has been converted to normalised form C before it reaches the
        > compiler.
        > ECMAScript implementations may recognize identifier characters defined
        > in later editions of the Unicode Standard. If portability is a
        > concern, programmers should only employ identifier characters defined
        > in Unicode 3.0.

        There is another relevant section (ECMA-262, 8.4 The String Type, pg 28)

        When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

        NOTE The rationale behind this design was to keep the implementation of Strings as simple and high-performing as possible. The intent is that textual data coming into the execution environment from outside (e.g., user input, text read from a file or received over the network, etc.) be converted to Unicode Normalised Form C before the running program sees it. Usually this would occur at the same time incoming text is converted from its original character encoding to Unicode (and would impose no additional overhead). Since it is recommended that ECMAScript source code be in Normalised Form C, string literals are guaranteed to be normalised (if source text is guaranteed to be normalised), as long as they do not contain any Unicode escape sequences.

        > I think it's fairly clear that a JSON parser has ABSOLUTELY NO
        > BUSINESS poking around with actual data strings; Douglas has been very
        > clear that you are to pass them bit-identical to the recipient. On the
        > other hand, there's an argument for some kind of sanitation when it
        > comes to object member names.
        > I'm really tempted by the idea of a JSON-secure spec, which clamps
        > down on these details.

        I disagree with your first statement. The ECMA-262 standard, at least in my opinion, tries to side step a lot of these issues. It makes a fairly clear distinction between "what happens inside the ECMA-262 environment (which it obviously has near total control over)" and "what happens outside the ECMA-262 environment".

        IMHO, the ECMA-262 standard advocates that "stuff that happens outside the ECMA-262 environment should be treated as if it is NFC".

        Since the sine qua non of JSON is the interchange of information between different environments and implementations, it must address any issues that can and will cause difficulties. Like it or not, the fact that it's Unicode means these things can and will happen, and it's simply not practical to expect or insist that every implementation treat JSON Strings as "just a simple array of Unicode Code Points".

        > Arguing the Unicode details is decidedly NOT compatible with the
        > "spirit" of JSON, which Douglas has been very clear about; a
        > lightweight, simple, modern data representation.

        I completely agree that these details are NOT compatible with the "spirit" of JSON.

        But.... so what? Unicode is not simple. I'm not the one who made it that way, but the way that RFC 4627 is written, you must deal with it. There are ways RFC 4627 could have been written such that the JSON to be parsed is considered a stream of 8 bit bytes, and therefore stripped of its Unicode semantics (if any). However, it very clearly and plainly says "JSON text SHALL be encoded in Unicode.", which pretty much kills the idea that you can just treat it as raw bytes.

        There's a saying about formalized standards: The standard is right. Even it's mistakes.

        As an aside, there is a RFC for "Unicode Format for Network Interchange", RFC 5198 (http://tools.ietf.org/html/rfc5198). It is 18 pages long. RFC 4627 is just 9 pages.

        Actually, I would encourage people to read RFC 5198. I'm not sure I agree with all of it, but it goes over a lot of the issues I think are very relevant to this conversation. It's great background info if you're not familiar with the details.
      Your message has been successfully submitted and would be delivered to recipients shortly.