Loading ...
Sorry, an error occurred while loading the content.

JSON and the Unicode Standard

Expand Messages
  • johne_ganz
    In RFC 4627, Section 3 Encoding, it states: JSON text SHALL be encoded in Unicode. The default encoding is UTF-8. Unicode is defined as: The Unicode
    Message 1 of 35 , Feb 23, 2011
    View Source
    • 0 Attachment
      In RFC 4627, Section 3 Encoding, it states:
      "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."
      Unicode is defined as: The Unicode Consortium, "The Unicode Standard
      Version 4.0", 2003, <http://www.unicode.org/versions/Unicode4.1.0/>.
      Is it safe to assume that RFC 4627 implies "The minimum Unicode Standard
      is version 4.0", or does it mean "The Unicode Standard as defined in
      version 4.0, and ONLY version 4.0" (i.e., later versions of the Unicode
      standard are non-RFC 4627 conforming)? The standard is silent on this
      point, but I believe a "best practices" interpretation is "The minimum
      Unicode Standard is version 4.0" with the implicit assumption that the
      Unicode Standard is strongly motivated to preserve backwards
      compatibility. Is this the accepted interpretation?
      Furthermore, I interpret the quoted RFC 4627 section to imply:
      Where RFC 4627 is in conflict with the Unicode Standard, the Unicode
      Standard interpretation shall be the one used unless explicitly and
      unambiguously superseded by RFC 4627. Otherwise, by referencing the
      Unicode Standard, the Unicode Standard is incorporated in to RFC 4627 as
      part of the requirements for JSON.
      In other words, JSON is built on top of Unicode. When defining JSON,
      the author(s) of RFC 4627 were aware of conflicts between what they were
      defining (JSON) and the Unicode Standard (at the time, v4.0), and have
      explicitly called out any exceptions that JSON requires.
      Assuming this is a valid interpretation, this places a number of
      requirements on a JSON implementation that are non-obvious by just
      reading RFC 4627. For example, from Unicode Standard (note: I'm using
      the latest version at the time of this writing, 6.0), Chapter 3
      Conformance, Section 3.4 Characters and Encoding
      (http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf):
      C2 A process shall not interpret a noncharacter code point as an
      abstract character. [ed: this is C5 in v4.0. The text appears to be
      identical.]
      D14 Noncharacter: A code point that is permanently reserved for
      internal use and that should never be interchanged. Noncharacters
      consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10_16
      [ed: base 16]) and the values U+FDD0..U+FDEF. [ed: this is D7b in v4.0.
      The text appears to be identical.]
      Unicode Standard (6.0), Chapter 2 General Structure, Section 2.13
      Special Characters and Noncharacters - Special Noncharacter Code Points:
      The Unicode Standard contains a number of code points that are
      intentionally not used to represent assigned characters. These code
      points are known as noncharacters. They are permanently reserved for
      internal use and should never be used for open interchange of Unicode
      text. For more information on noncharacters, see Section 16.7,
      Noncharacters. [ed: have not compared this to v4.0]
      Unicode Standard (6.0), Chapter 16 Special Areas and Format Characters,
      Section 16.7 Noncharacters:
      Applications are free to use any of these noncharacter code points
      internally but should never attempt to exchange them. If a noncharacter
      is received in open interchange, an application is not required to
      interpret it in any way. It is good practice, however, to recognize it
      as a noncharacter and to take appropriate action, such as replacing it
      with U+FFFD replacement character, to indicate the problem in the text.
      It is not recommended to simply delete noncharacter code points from
      such text, because of the potential security issues caused by deleting
      uninterpreted characters. [ed: have not compared this to v4.0]
      ---------
      This means strings like "\ufffe", "\ufdd0", "\ud83f\udfff" are
      "noncharacters", and a plain reading of the standard clearly implies
      that it is in some way "invalid" (I quote the term because the Unicode
      standard has a lot to say about how to deal with this). While the
      examples given are the \u escaped variety, it should be obvious that the
      (same) code points U+FFFE, U+FDD0, U+1FFFF encoded in their UTF-*
      representation are also "invalid". In UTF-8, this would be <EF BF BF>,
      <EF B7 90>, <F0 9F BF BE>.
      Unicode Standard (6.0), Chapter 3 Conformance, Section 3.9 Unicode
      Encoding Forms (http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf)
      covers a lot of these details. In particular, the section "Best
      Practices for Using U+FFFD" gives details on using the special U+FFFD
      replacement character to replace the "invalid" Unicode. For example,
      the \u escape sequence of '\ud83f\udfff' in a quoted string would be
      replaced with a single U+FFFD.
      It is important to note that there have been extensive changes to
      section 3.9 between 4.0 and 6.0. Some of these are due to various
      security issues (http://www.unicode.org/faq/security.html).
      Is the above the "generally agreed on" interpretation of how things
      should be done?
      Is it safe to say that a "strictly RFC 4627 conforming JSON
      implementation" MUST also be "strictly Unicode Standard conforming" (at
      least in terms of Chapter / Section 3 of the Unicode Standard,
      "Conformance")?
      Is there an opinion on whether or not JSON that is used for interchange
      SHOULD NOT or MUST NOT contain "noncharacters"? That is to say that a
      JSON generator should/must not create JSON with noncharacters, and a
      parser should/must either reject as invalid or replace such
      noncharacters with U+FFFD? There's technically a difference between
      JSON used for interchange and JSON not used for interchange since the
      Unicode Standard allows an implementation to use the noncharacters as
      "internal, private" code points, but those characters should not be
      present in the Unicode that (for some reasonable definition of) "leaves
      the implementation". Personally, I don't think such a distinction
      should be made for JSON, or is really even meaningful, and all JSON
      should/must be treated as "interchange".
      The Unicode Standard, and in particular later versions of the standard,
      for all practical purposes make it a requirement that "characters MUST
      NOT be deleted". One course of action is to simply not accept a string
      and report an error, and another is to replace a bad or malformed
      character with U+FFFD. There are some very compelling security related
      reasons for doing this. Is there an opinion that a RFC 4627 JSON
      implementation "MUST NOT arbitrarily delete characters" as well? (This
      is a somewhat complicated issue, see
      http://www.unicode.org/faq/security.html and
      http://www.unicode.org/reports/tr36/ for more info, in particular UTR#36
      - Section 3 "Non-Visual Security Issues").


      [Non-text portions of this message have been removed]
    • johne_ganz
      ... There is another relevant section (ECMA-262, 8.4 The String Type, pg 28) When a String contains actual textual data, each element is considered to be a
      Message 35 of 35 , Mar 3, 2011
      View Source
      • 0 Attachment
        --- In json@yahoogroups.com, Dave Gamble <davegamble@...> wrote:
        >
        > To save people looking it up:
        >
        > ECMA-262, section 7.6:
        >
        > Two IdentifierName that are canonically equivalent according to the
        > Unicode standard are not equal unless they are represented by the
        > exact same sequence of code units (in other words, conforming
        > ECMAScript implementations are only required to do bitwise comparison
        > on IdentifierName values). The intent is that the incoming source text
        > has been converted to normalised form C before it reaches the
        > compiler.
        >
        > ECMAScript implementations may recognize identifier characters defined
        > in later editions of the Unicode Standard. If portability is a
        > concern, programmers should only employ identifier characters defined
        > in Unicode 3.0.

        There is another relevant section (ECMA-262, 8.4 The String Type, pg 28)

        When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

        NOTE The rationale behind this design was to keep the implementation of Strings as simple and high-performing as possible. The intent is that textual data coming into the execution environment from outside (e.g., user input, text read from a file or received over the network, etc.) be converted to Unicode Normalised Form C before the running program sees it. Usually this would occur at the same time incoming text is converted from its original character encoding to Unicode (and would impose no additional overhead). Since it is recommended that ECMAScript source code be in Normalised Form C, string literals are guaranteed to be normalised (if source text is guaranteed to be normalised), as long as they do not contain any Unicode escape sequences.

        > I think it's fairly clear that a JSON parser has ABSOLUTELY NO
        > BUSINESS poking around with actual data strings; Douglas has been very
        > clear that you are to pass them bit-identical to the recipient. On the
        > other hand, there's an argument for some kind of sanitation when it
        > comes to object member names.
        > I'm really tempted by the idea of a JSON-secure spec, which clamps
        > down on these details.

        I disagree with your first statement. The ECMA-262 standard, at least in my opinion, tries to side step a lot of these issues. It makes a fairly clear distinction between "what happens inside the ECMA-262 environment (which it obviously has near total control over)" and "what happens outside the ECMA-262 environment".

        IMHO, the ECMA-262 standard advocates that "stuff that happens outside the ECMA-262 environment should be treated as if it is NFC".

        Since the sine qua non of JSON is the interchange of information between different environments and implementations, it must address any issues that can and will cause difficulties. Like it or not, the fact that it's Unicode means these things can and will happen, and it's simply not practical to expect or insist that every implementation treat JSON Strings as "just a simple array of Unicode Code Points".

        > Arguing the Unicode details is decidedly NOT compatible with the
        > "spirit" of JSON, which Douglas has been very clear about; a
        > lightweight, simple, modern data representation.

        I completely agree that these details are NOT compatible with the "spirit" of JSON.

        But.... so what? Unicode is not simple. I'm not the one who made it that way, but the way that RFC 4627 is written, you must deal with it. There are ways RFC 4627 could have been written such that the JSON to be parsed is considered a stream of 8 bit bytes, and therefore stripped of its Unicode semantics (if any). However, it very clearly and plainly says "JSON text SHALL be encoded in Unicode.", which pretty much kills the idea that you can just treat it as raw bytes.

        There's a saying about formalized standards: The standard is right. Even it's mistakes.

        As an aside, there is a RFC for "Unicode Format for Network Interchange", RFC 5198 (http://tools.ietf.org/html/rfc5198). It is 18 pages long. RFC 4627 is just 9 pages.

        Actually, I would encourage people to read RFC 5198. I'm not sure I agree with all of it, but it goes over a lot of the issues I think are very relevant to this conversation. It's great background info if you're not familiar with the details.
      Your message has been successfully submitted and would be delivered to recipients shortly.