Loading ...
Sorry, an error occurred while loading the content.

Re: JSON and the Unicode Standard

Expand Messages
  • johne_ganz
    ... This is my point. It happens both on the serialization side, but (at least in my opinion), it is much more likely to happen on the deserialization side.
    Message 1 of 35 , Feb 25, 2011
    View Source
    • 0 Attachment
      --- In json@yahoogroups.com, David Graham <david.malcom.graham@...> wrote:
      >
      > 3. Ruby and Java consider combined characters to be unequal to their single
      > codepoint counterparts. The é character, for example, can be a 2 byte
      > single codepoint form of \u00e9 or a 3 byte two codepoint form of
      > \u0065\u0301.
      >
      >
      > In Ruby, "\u00e9" == "\u0065\u0301" => false.
      >
      >
      > So, given a Ruby Hash (or Java Map) like this:
      >
      >
      > {"\u00e9" => 1, "\u0065\u0301" => 2}
      >
      > => {"é"=>1, "é"=>2}
      >
      >
      > A JSON serializer that performed Unicode normalization on this Hash object
      > would corrupt the data in some way. The two keys would become equal, so
      > which value gets serialized: 1 or 2?

      This is my point. It happens both on the serialization side, but (at least in my opinion), it is much more likely to happen on the deserialization side.

      What happens to a JSON deserializer that relies on external libraries (say ICU), or in object oriented programming, a "string" class to handle all this, or is in some other way completely out of the control the the person writing the JSON deserializer?

      How many people do you think actually checked to make sure that these external code dependencies offer a guarantee that they will not mutilate or otherwise transform the original string in some Unicode Equivalence compatible way?

      From the Unicode Standard, Section 2.12, Equivalent Sequences and Normalization-

      If an application or user attempts to distinguish between canonically equivalent sequences, as shown in the first example in Figure 2-23, there is no guarantee that other applications would recognize the same distinctions. To prevent the introduction of interoperability problems between applications, such distinctions must be avoided wherever possible. Making distinctions between compatibly equivalent sequences is less problematical. However, in restricted contexts, such as the use of identifiers, avoiding compatibly equivalent sequences reduces possible security issues. See Unicode Technical Report #36, "Unicode Security Considerations."

      In other words, the Unicode Standard says that the behavior that you are observing is not guaranteed. This means that there exists the very real possibility that a JSON implementation that depends on external code to handle strings (i.e., ICU or a "string" object in object oriented languages) can not reasonably ensure that said code does not convert the string argument in to a Unicode Standard equivalent form.

      The practical implication of this is that the behavior that you are seeing is contrary to the requirements and expectations set forth in the Unicode standard. It seems reasonable to assume that external libraries that adhere to the Unicode standard that a JSON implementation is using are under no obligation what so ever to treat a Unicode string in a way that you have described.

      > In my opinion, this means JSON parsers and generators must not perform
      > normalization. They must respect the data stored in the JSON byte stream as
      > is.

      It's trivial for a parser to respect the data stored in the JSON byte stream.

      While I'm sure there are exceptions to this, I'd be willing to bet that the majority of JSON parsers hand the parsed string off to some "create a string" piece of code. It seems reasonable to assume that this "create a string" piece of code is Unicode aware. These code bases are probably disjoint, with the string handling code focused on Unicode Standard conformance, and said conformance does not require that it "respect [the original string]".

      In fact, for my parser (JSONKit), which is Objective-C based and uses NSString to represent the JSON String objects, it is not practical for me to create a JSON parser that "respects the data stored in the JSON byte stream". The NSString class makes no such guarantees in its documentation, nor does the Unicode Standard. It would be extremely non-trivial for me to meet a "respects the data stored in the JSON byte stream" requirement, at least in the sense that the behavior is deterministic.
    • johne_ganz
      ... There is another relevant section (ECMA-262, 8.4 The String Type, pg 28) When a String contains actual textual data, each element is considered to be a
      Message 35 of 35 , Mar 3, 2011
      View Source
      • 0 Attachment
        --- In json@yahoogroups.com, Dave Gamble <davegamble@...> wrote:
        >
        > To save people looking it up:
        >
        > ECMA-262, section 7.6:
        >
        > Two IdentifierName that are canonically equivalent according to the
        > Unicode standard are not equal unless they are represented by the
        > exact same sequence of code units (in other words, conforming
        > ECMAScript implementations are only required to do bitwise comparison
        > on IdentifierName values). The intent is that the incoming source text
        > has been converted to normalised form C before it reaches the
        > compiler.
        >
        > ECMAScript implementations may recognize identifier characters defined
        > in later editions of the Unicode Standard. If portability is a
        > concern, programmers should only employ identifier characters defined
        > in Unicode 3.0.

        There is another relevant section (ECMA-262, 8.4 The String Type, pg 28)

        When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

        NOTE The rationale behind this design was to keep the implementation of Strings as simple and high-performing as possible. The intent is that textual data coming into the execution environment from outside (e.g., user input, text read from a file or received over the network, etc.) be converted to Unicode Normalised Form C before the running program sees it. Usually this would occur at the same time incoming text is converted from its original character encoding to Unicode (and would impose no additional overhead). Since it is recommended that ECMAScript source code be in Normalised Form C, string literals are guaranteed to be normalised (if source text is guaranteed to be normalised), as long as they do not contain any Unicode escape sequences.

        > I think it's fairly clear that a JSON parser has ABSOLUTELY NO
        > BUSINESS poking around with actual data strings; Douglas has been very
        > clear that you are to pass them bit-identical to the recipient. On the
        > other hand, there's an argument for some kind of sanitation when it
        > comes to object member names.
        > I'm really tempted by the idea of a JSON-secure spec, which clamps
        > down on these details.

        I disagree with your first statement. The ECMA-262 standard, at least in my opinion, tries to side step a lot of these issues. It makes a fairly clear distinction between "what happens inside the ECMA-262 environment (which it obviously has near total control over)" and "what happens outside the ECMA-262 environment".

        IMHO, the ECMA-262 standard advocates that "stuff that happens outside the ECMA-262 environment should be treated as if it is NFC".

        Since the sine qua non of JSON is the interchange of information between different environments and implementations, it must address any issues that can and will cause difficulties. Like it or not, the fact that it's Unicode means these things can and will happen, and it's simply not practical to expect or insist that every implementation treat JSON Strings as "just a simple array of Unicode Code Points".

        > Arguing the Unicode details is decidedly NOT compatible with the
        > "spirit" of JSON, which Douglas has been very clear about; a
        > lightweight, simple, modern data representation.

        I completely agree that these details are NOT compatible with the "spirit" of JSON.

        But.... so what? Unicode is not simple. I'm not the one who made it that way, but the way that RFC 4627 is written, you must deal with it. There are ways RFC 4627 could have been written such that the JSON to be parsed is considered a stream of 8 bit bytes, and therefore stripped of its Unicode semantics (if any). However, it very clearly and plainly says "JSON text SHALL be encoded in Unicode.", which pretty much kills the idea that you can just treat it as raw bytes.

        There's a saying about formalized standards: The standard is right. Even it's mistakes.

        As an aside, there is a RFC for "Unicode Format for Network Interchange", RFC 5198 (http://tools.ietf.org/html/rfc5198). It is 18 pages long. RFC 4627 is just 9 pages.

        Actually, I would encourage people to read RFC 5198. I'm not sure I agree with all of it, but it goes over a lot of the issues I think are very relevant to this conversation. It's great background info if you're not familiar with the details.
      Your message has been successfully submitted and would be delivered to recipients shortly.