Loading ...
Sorry, an error occurred while loading the content.

Re: JSON and the Unicode Standard

Expand Messages
  • johne_ganz
    ... True. It would seem, at least to me, that this is one of those nuanced points that either a) Has not been given the proper consideration by Unicode
    Message 1 of 35 , Feb 25, 2011
      --- In json@yahoogroups.com, Tatu Saloranta <tsaloranta@...> wrote:
      > On Fri, Feb 25, 2011 at 3:09 PM, johne_ganz <john.engelhart@...> wrote:
      > ...
      > > Unicode is not just a simple set of code points.
      > This is true statement, although the more practical question seems to
      > be what is the practical relationship of JSON with Unicode
      > specification.

      True. It would seem, at least to me, that this is one of those nuanced points that either

      a) Has not been given the proper consideration by Unicode (ostensibly) experts.

      b) The Unicode standard has evolved in such a way since the publication of RFC 4627 that it may require revisiting the issue.

      > I think your suggestions for clarifying some parts do make sense,
      > although it may be hard to reconcile basic diffences between full
      > Unicode support, and goals of simplicity for JSON.

      I'm all for simplicity, and for a "less is more" philosophy. Unfortunately, RFC 4627 allows for two strictly RFC 4627 compliant implementations to "generate" wildly different results (were "generate" here means the JSON is parsed and interpreted in such a way that the two implementations have what reasonable people would consider "very different semantics").

      Numbers are another corner case. 4627 only describes how to parse a decimal representation of numbers (both integer and floating-point). In practice this means that a strictly conforming RFC 4627 JSON implementation can use a 8, 16, 32, or 64 bit "native primitive" to represent integers. It's perfectly valid JSON to have integer numbers that require 128 or 256 bits in order to represent them. To me, a serialization format such as JSON should make some effort to ensure that the values contained within it will be properly interpreted by any and all JSON implementations. I've seen several JSON implementations that use a 32-bit size C99 primitive type to represent the parsed numbers. This is a problem for anything that wants to parse contemporary Twitter JSON as the ID's are > 2^32 at this point. The desire to "keep it simple" has to be balanced against real world practical needs- when your ID's exceed 2^32, it is a legitimate question to ask "Are JSON implementations going to handle this value correctly?" If not 2^32, then when?

      > >
      > > Another issue is normalization.  In particular, the way normalization is handled for the "key" portion of an "object" (i.e., {"key": "value"}) can dramatically alter the meaning and contents of the object.  For example:
      > >
      > > {
      > > "\u212b": "one",
      > > "\u0041\u030a": "two",
      > > "\u00c5": "three"
      > > }
      > >
      > > Are these three keys distinct?  Should there be a requirement that they MUST be handled and interpreted such that they are distinct?  Does that requirement extend past the "channel" demarcation point (i.e., not a JSON library or communication channel used to interchange the JSON between two hosts) to the "host language"?
      > >
      > > In case it is not obvious, under the rules of Unicode NFC (Normalization Form C), all three of the keys above will become "\u00c5" after NFC processing.
      > For what it is worth, I have not seen a single JSON parser that would
      > do such normalization; and the only XML parser I recall even trying
      > proper Unicode code point normalization was XOM. This is not an
      > argument against proper handling, but rather an observation regarding
      > how much of a practical issue it seems to be.

      I have not seen a JSON implementation / parser that does such normalization.

      On the other hand, I very strongly suspect that whether or not such normalization is taking place is not up to the writer of that parser. In my particular case (JSONKit, for Objective-C), I pass the parsed JSON String to the NSString class to instantiate an object.

      I have ZERO control over what and how NSString interprets or manipulates the parsed JSON String that finally becomes the instantiated object that ostensibly the same as the original JSON String used to create it. It could be that NSString decides that the instantiated object is always converted to its precomposed form. Objective-C is flexible enough where someone might decide to swizzle in some logic at run time that forces all strings to be precomposed before being handed off to the main NSString instantiation method.

      > Nor have I seen feature requests to support normalization (XOM
      > implements it because its author is very ambitious wrt supporting
      > standards, it is very respectable achievement), during time I have
      > spend maintaining XML and JSON parser/generator implementations.
      > Do others have difference experiences?

      I don't have a particular opinion on the matter one way or the other other than to highlight the point that in many practical, real-world situations, whether or not such things take place may not be under the control of the JSON parser.

      I also suspect that it's one of those things that most people haven't really given a whole lot of consideration to- they just had the parsed string over to "the Unicode string handling code", and that's that. Most people may not realize that such string handling code may subtly alter the original Unicode text as a result (ala precomposing the string).

      > So to me it seems that most likely high-level clarifications regarding
      > normalization aspects would be:
      > (a) Whether to do normalization or not is up to implementation
      > (normalization is left out of scope, on purpose), or
      > (b) Say that with JSON no normalization would be done (which would be
      > more at odds with unicode spec)
      > Why? Just because I see very little chance of anything more ambitious
      > having effect on implementations (beyond small number that are willing
      > to tackle such complexity). While it would seem wrong to punt the
      > issue, there is the practical question of whether full solution would
      > matter.

      I can guarantee you that the practical question of whether a full solution would matter will be answered the first time someone exploits it in a security vulnerable way that results in a major security fiasco.

      Then it will be with 20/20 hindsight, and the question will be "Why didn't anyone address (this behavior) that allowed two keys that were not bit for bit identical, but became identical after converting them to their precomposed form, and the security checks allowed the decomposed form through because it assumed that everything was in precomposed form?"

      Unfortunately, the use of Unicode coupled with the fact that most JSON implementations are dependent on external code for their Unicode support means that this is an extremely non-trivial issue. I can't think of a simple solution to the problem at the moment, other than it exists.

      > My guess is that about last thing I implements would want was a
      > mandate to support full Unicode 4.0 (and above) normalization rules.
      > It would just mean that there would be the specification in one
      > corner; and implementations, practically none of which would be
      > compliant.

      You really ought to read:



      Microsoft Security Bulletin (MS00-078): Patch Available for 'Web Server Folder Traversal' Vulnerability (http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx, http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2000-0884)

      Creating Arbitrary Shellcode In Unicode Expanded Strings (http://www.net-security.org/article.php?id=144)

      There's a long history of "Those little Unicode details aren't really important" causing huge security problems later on.
    • johne_ganz
      ... There is another relevant section (ECMA-262, 8.4 The String Type, pg 28) When a String contains actual textual data, each element is considered to be a
      Message 35 of 35 , Mar 3, 2011
        --- In json@yahoogroups.com, Dave Gamble <davegamble@...> wrote:
        > To save people looking it up:
        > ECMA-262, section 7.6:
        > Two IdentifierName that are canonically equivalent according to the
        > Unicode standard are not equal unless they are represented by the
        > exact same sequence of code units (in other words, conforming
        > ECMAScript implementations are only required to do bitwise comparison
        > on IdentifierName values). The intent is that the incoming source text
        > has been converted to normalised form C before it reaches the
        > compiler.
        > ECMAScript implementations may recognize identifier characters defined
        > in later editions of the Unicode Standard. If portability is a
        > concern, programmers should only employ identifier characters defined
        > in Unicode 3.0.

        There is another relevant section (ECMA-262, 8.4 The String Type, pg 28)

        When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

        NOTE The rationale behind this design was to keep the implementation of Strings as simple and high-performing as possible. The intent is that textual data coming into the execution environment from outside (e.g., user input, text read from a file or received over the network, etc.) be converted to Unicode Normalised Form C before the running program sees it. Usually this would occur at the same time incoming text is converted from its original character encoding to Unicode (and would impose no additional overhead). Since it is recommended that ECMAScript source code be in Normalised Form C, string literals are guaranteed to be normalised (if source text is guaranteed to be normalised), as long as they do not contain any Unicode escape sequences.

        > I think it's fairly clear that a JSON parser has ABSOLUTELY NO
        > BUSINESS poking around with actual data strings; Douglas has been very
        > clear that you are to pass them bit-identical to the recipient. On the
        > other hand, there's an argument for some kind of sanitation when it
        > comes to object member names.
        > I'm really tempted by the idea of a JSON-secure spec, which clamps
        > down on these details.

        I disagree with your first statement. The ECMA-262 standard, at least in my opinion, tries to side step a lot of these issues. It makes a fairly clear distinction between "what happens inside the ECMA-262 environment (which it obviously has near total control over)" and "what happens outside the ECMA-262 environment".

        IMHO, the ECMA-262 standard advocates that "stuff that happens outside the ECMA-262 environment should be treated as if it is NFC".

        Since the sine qua non of JSON is the interchange of information between different environments and implementations, it must address any issues that can and will cause difficulties. Like it or not, the fact that it's Unicode means these things can and will happen, and it's simply not practical to expect or insist that every implementation treat JSON Strings as "just a simple array of Unicode Code Points".

        > Arguing the Unicode details is decidedly NOT compatible with the
        > "spirit" of JSON, which Douglas has been very clear about; a
        > lightweight, simple, modern data representation.

        I completely agree that these details are NOT compatible with the "spirit" of JSON.

        But.... so what? Unicode is not simple. I'm not the one who made it that way, but the way that RFC 4627 is written, you must deal with it. There are ways RFC 4627 could have been written such that the JSON to be parsed is considered a stream of 8 bit bytes, and therefore stripped of its Unicode semantics (if any). However, it very clearly and plainly says "JSON text SHALL be encoded in Unicode.", which pretty much kills the idea that you can just treat it as raw bytes.

        There's a saying about formalized standards: The standard is right. Even it's mistakes.

        As an aside, there is a RFC for "Unicode Format for Network Interchange", RFC 5198 (http://tools.ietf.org/html/rfc5198). It is 18 pages long. RFC 4627 is just 9 pages.

        Actually, I would encourage people to read RFC 5198. I'm not sure I agree with all of it, but it goes over a lot of the issues I think are very relevant to this conversation. It's great background info if you're not familiar with the details.
      Your message has been successfully submitted and would be delivered to recipients shortly.