Loading ...
Sorry, an error occurred while loading the content.

Re: JSON and the Unicode Standard

Expand Messages
  • johne_ganz
    ... It is my opinion that the answer is Yes . The standard must address some of the issues introduced by the use of Unicode (see below). Then there is the
    Message 1 of 35 , Mar 1, 2011
      --- In json@yahoogroups.com, Tatu Saloranta <tsaloranta@...> wrote:
      > On Fri, Feb 25, 2011 at 8:01 PM, johne_ganz <john.engelhart@...> wrote:
      > > --- In json@yahoogroups.com, Tatu Saloranta <tsaloranta@> wrote:
      > > my particular case (JSONKit, for Objective-C), I pass the parsed JSON String to the NSString class to instantiate an object.
      > >
      > > I have ZERO control over what and how NSString interprets or manipulates the parsed JSON String that finally becomes the instantiated object that ostensibly the same as the original JSON String used to create it.  It could be that NSString decides that the instantiated object is
      > > always converted to its precomposed form.  Objective-C is flexible enough where someone might decide to swizzle in some logic at run time that forces all strings to be precomposed before being handed off to the main NSString instantiation method.
      > Ok. But in this case, would JSON specification itself help a lot? I
      > understand that this is problematic, in that different platforms can
      > choose different default (and possible opaque dealing).

      It is my opinion that the answer is "Yes". The standard must address some of the issues introduced by the use of Unicode (see below). Then there is the practical real world issue that many JSON implementations are going to use external code to manage the "Unicode part", and I think it's fair to say that that external code is going to be focused on Unicode Standard compliance rather than implementing semantics that are useful or even desired for RFC 4627 compliance.

      Please, don't get me wrong, I honestly wish that the whole thing could be treated as some sort of ideal "extended ASCII" that was for all practical purposes synonymous with "binary". This would be much, much simpler. But that's not Unicode.

      > ...
      > > I don't have a particular opinion on the matter one way or the other other than to highlight the point that in many practical, real-world situations, whether or not such things take place may not be under the control of the JSON parser.
      > > I also suspect that it's one of those things that most people haven't really given a whole lot of consideration to- they just had the parsed string over to "the Unicode string handling code", and that's that. Most people may not realize that such string handling code may subtly alter the original Unicode text as a result (ala precomposing the string).
      > Right. And if specification says nothing, it can uncover real
      > complexities and ambiguities.

      Yes. The use of Unicode, and the language surrounding the issue of Unicode in RFC 4627 means there are some very real complexities and ambiguities. The particular example that comes to mind is

      "What does it mean for two keys (or names in RFC 4627 nomenclature) to compare equal?"

      For example:

      { // Example #1
      "Ä" : "launch nukes",
      "Ä" : "do not launch nukes

      Do these keys "compare equal"?

      { // Example #2
      "\u00C4" : "launch nukes",
      "A\u0308" : "do not launch nukes

      How about this?
      Is it "identical" to example #1?
      Do the keys in example #2 compare equal?
      Do the keys in example #2 compare equal to their respective keys in example #1?

      From ECMA-262, "ECMAScript Language Specification", 5th Edition / December 2009, page 11, section 6 "Source Text":

      ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. The text is expected to have been normalised to Unicode Normalised Form C (canonical composition), as described in Unicode Technical Report #15.

      So let's say you're using a (Java|ECMA)Script editor to edit your JSON.
      And the editor happens to follow this advice, as given in the ECMA-262 document.

      What happens to example #1 in this case?

      > ...
      > >> to tackle such complexity). While it would seem wrong to punt the
      > >> issue, there is the practical question of whether full solution would
      > >> matter.
      > >
      > > I can guarantee you that the practical question of whether a full solution would matter will be answered the first time someone exploits it in a security vulnerable way that results in a major security fiasco.
      > I would be interested in how you would see this leading to security
      > issues, outside of problems specific String handling on platforms has.

      It doesn't necessarily have anything to do with a platforms string handling, it has to do with Unicode.

      "A": 1,
      "A": 2,
      "𝖠": 3,
      "Å": 4,
      "Å": 5,
      "Å": 6,
      "𝖠̊": 7

      Unicode vastly complicates the above. If one uses a unicode aware editor to edit the above, it is perfectly fine for it to mangle it so that it is not precisely the unicode I pasted. In fact, it wouldn't surprise me if this groups.yahoo.com software washes it through a bit of unicode processing and what finally appears isn't exactly what I put in.

      One also needs to switch to the mindset of a security person, not someone who is interested in writing a JSON specification or parser implementation.

      Security people love to sell and stick magic boxes that sit in the network, usually between you and the bad, evil internet. One particular brand of voodoo, known as the firewall, will occasionally sanitize or reject data from the bad, outside internet.

      Now imagine you're a security person, and you're buying or making one of these magic boxes. You know some of the issues involved and that various JSON implementations are all over the map when it comes to how they deal with the corner cases, and these corner cases can dramatically alter what it means for two keys to "compare equal". Which way are you going to come down on the issue?

      > Or are you equally concerned in general about parser implementation
      > quality (which is understandable), above and beyond question of what
      > JSON specification says? At least to me it would seem more likely that
      > issues would be outside of realm of core specification itself.

      Don't care about particular implementations.

      Keep in mind there's a huge difference between what the spec says and what people do.

      The spec should be "right", for some strong definition of right. It should also not exist solely in some idealized vacuum, but be tempered with the practical, real world issues that real world implementations of the standard have to deal with. It should represent "the best possible" at the time the standard was forged, incorporating the wisdom and experience of those who actually have to deal with and implement whatever the standard represents so that those who come after, who may not have similar levels of experience or willingness to thoroughly examine all the issues can use the standard with some degree of safety and confidence.

      > > Then it will be with 20/20 hindsight, and the question will be "Why didn't anyone address (this behavior) that allowed two keys that were not bit for bit identical, but became identical after converting them to their precomposed form, and the security checks allowed the
      > > decomposed form through because it assumed that everything was in precomposed form?"
      > I can see how this can be problematic from side of applications that
      > make assumptions on uniqueness. And also that it is important that
      > parsers will clearly define how they handle things -- not all parsers
      > necessarily even check for uniqueness for same byte patterns, much
      > less for normalization (and I think this is even allowed by the spec,
      > i.e. uniqueness checks are not mandated).

      I am in violent disagreement with this entire premiss.

      > > There's a long history of "Those little Unicode details aren't really important" causing huge security problems later on.
      > Thank you. While I had heard about issues with request to
      > non-canonical UTF-8 code sequences (which were discussed to have such
      > issues), I admit I had not heard much about issue regarding
      > normalization.

      I would also recommend downloading the Unicode Standard (http://www.unicode.org/versions/Unicode6.0.0/UnicodeStandard-6.0.pdf) and doing a simple search for "security". This will give you a list of pages that are probably the most relevant to what I'm talking about.

      And keep in mind that those issues are directly related to JSON because JSON is "encoded as Unicode". Anything that treats JSON as Unicode, such as a text editor or linked library like ICU, is going to follow the rules and recommendations of the Unicode Standard. This means in the real world, JSON is likely to be washed through one of these libraries and be exposed to the Unicode standard, and that standard DOES NOT require it to preserve the exact sequence of bytes as Douglas Crockford thinks it should.

      Even the official ECMA recommendation says that it expects "the source to be normalised to Unicode Normalised Form C". It's one thing to write code that manipulates data and bytes that are (for some definition of) "local" to that instance of the program at that point in time. It's an entirely different thing when you start slinging bytes between machines or need the bytes to be archived and possibly processed by a different program.
    • johne_ganz
      ... There is another relevant section (ECMA-262, 8.4 The String Type, pg 28) When a String contains actual textual data, each element is considered to be a
      Message 35 of 35 , Mar 3, 2011
        --- In json@yahoogroups.com, Dave Gamble <davegamble@...> wrote:
        > To save people looking it up:
        > ECMA-262, section 7.6:
        > Two IdentifierName that are canonically equivalent according to the
        > Unicode standard are not equal unless they are represented by the
        > exact same sequence of code units (in other words, conforming
        > ECMAScript implementations are only required to do bitwise comparison
        > on IdentifierName values). The intent is that the incoming source text
        > has been converted to normalised form C before it reaches the
        > compiler.
        > ECMAScript implementations may recognize identifier characters defined
        > in later editions of the Unicode Standard. If portability is a
        > concern, programmers should only employ identifier characters defined
        > in Unicode 3.0.

        There is another relevant section (ECMA-262, 8.4 The String Type, pg 28)

        When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

        NOTE The rationale behind this design was to keep the implementation of Strings as simple and high-performing as possible. The intent is that textual data coming into the execution environment from outside (e.g., user input, text read from a file or received over the network, etc.) be converted to Unicode Normalised Form C before the running program sees it. Usually this would occur at the same time incoming text is converted from its original character encoding to Unicode (and would impose no additional overhead). Since it is recommended that ECMAScript source code be in Normalised Form C, string literals are guaranteed to be normalised (if source text is guaranteed to be normalised), as long as they do not contain any Unicode escape sequences.

        > I think it's fairly clear that a JSON parser has ABSOLUTELY NO
        > BUSINESS poking around with actual data strings; Douglas has been very
        > clear that you are to pass them bit-identical to the recipient. On the
        > other hand, there's an argument for some kind of sanitation when it
        > comes to object member names.
        > I'm really tempted by the idea of a JSON-secure spec, which clamps
        > down on these details.

        I disagree with your first statement. The ECMA-262 standard, at least in my opinion, tries to side step a lot of these issues. It makes a fairly clear distinction between "what happens inside the ECMA-262 environment (which it obviously has near total control over)" and "what happens outside the ECMA-262 environment".

        IMHO, the ECMA-262 standard advocates that "stuff that happens outside the ECMA-262 environment should be treated as if it is NFC".

        Since the sine qua non of JSON is the interchange of information between different environments and implementations, it must address any issues that can and will cause difficulties. Like it or not, the fact that it's Unicode means these things can and will happen, and it's simply not practical to expect or insist that every implementation treat JSON Strings as "just a simple array of Unicode Code Points".

        > Arguing the Unicode details is decidedly NOT compatible with the
        > "spirit" of JSON, which Douglas has been very clear about; a
        > lightweight, simple, modern data representation.

        I completely agree that these details are NOT compatible with the "spirit" of JSON.

        But.... so what? Unicode is not simple. I'm not the one who made it that way, but the way that RFC 4627 is written, you must deal with it. There are ways RFC 4627 could have been written such that the JSON to be parsed is considered a stream of 8 bit bytes, and therefore stripped of its Unicode semantics (if any). However, it very clearly and plainly says "JSON text SHALL be encoded in Unicode.", which pretty much kills the idea that you can just treat it as raw bytes.

        There's a saying about formalized standards: The standard is right. Even it's mistakes.

        As an aside, there is a RFC for "Unicode Format for Network Interchange", RFC 5198 (http://tools.ietf.org/html/rfc5198). It is 18 pages long. RFC 4627 is just 9 pages.

        Actually, I would encourage people to read RFC 5198. I'm not sure I agree with all of it, but it goes over a lot of the issues I think are very relevant to this conversation. It's great background info if you're not familiar with the details.
      Your message has been successfully submitted and would be delivered to recipients shortly.