Loading ...
Sorry, an error occurred while loading the content.

JSON strings cannot point to post-BMP Unicode codepoints?

Expand Messages
  • Shriramana Sharma
    Hello. I am thinking of using JSON for storing the data produced by my program. As I was reading through the website (json.org) I notice that the specification
    Message 1 of 8 , Apr 7, 2013
    • 0 Attachment
      Hello. I am thinking of using JSON for storing the data produced by my program.

      As I was reading through the website (json.org) I notice that the
      specification of strings mentions that \u followed by *four* hex
      digits can be used to represent Unicode codepoints. However, if that
      restriction of *four* hex digits is meant to be enforced, then it
      means that post-BMP codepoints (such as 0x11005 BRAHMI LETTER A)
      cannot be represented in such strings directly, but that they have to
      be manually (i.e. by the program outputting JSON-ed data) decomposed
      into their equivalent UTF16 surrogate pairs (for instance, 0xd804
      0xdc05).

      IMHO this is an unnecessary restriction. Modern standards (for
      instance Python 3, C11, C++11) allow post-BMP codepoints to be
      represented in string literals, using a capital U as in \U00011005. In
      fact, in C/C++ it is *prohibited* to use surrogate code points as part
      of a string literal. (A good idea which eliminates the possibility of
      unpaired surrogates altogether.)

      As a researcher interested in ancient scripts of South India I have to
      handle these SMP codepoints often, even entire texts in such scripts.
      Can JSON not support the \Uxxxxxxxx notation?

      http://en.wikipedia.org/wiki/JSON says: "The default character
      encoding for JSON is UTF8; it also supports UTF16 and UTF32." but I'm
      not sure about it because it is not mentioned explicitly on the
      json.org page and it is also not very clear to me as to what exactly
      that statement means. Does it mean that even though there is no \U
      notation, I can directly input post-BMP codepoints as part of the
      string literals? The json.org page does say "any-Unicode-character".
      In this case even the \u notation is only there as a just-in-case?
      (Even if so, why not \U too just-in-case?)

      TIA for your kind explanations and comments,

      --
      Shriramana Sharma
    • douglascrockford
      From RFC 4672 http://www.ietf.org/rfc/rfc4627.txt?number=4627 To escape an extended character that is not in the Basic Multilingual Plane, the character is
      Message 2 of 8 , Apr 7, 2013
      • 0 Attachment
        From RFC 4672 http://www.ietf.org/rfc/rfc4627.txt?number=4627

        To escape an extended character that is not in the Basic Multilingual
        Plane, the character is represented as a twelve-character sequence,
        encoding the UTF-16 surrogate pair. So, for example, a string
        containing only the G clef character (U+1D11E) may be represented as
        "\uD834\uDD1E".
      • David Heiko Kolf
        ... Yes, you can put post-BMP codepoints directly as part of the string literals. If there are control characters outside of the BMP in your text they could
        Message 3 of 8 , Apr 7, 2013
        • 0 Attachment
          Shriramana Sharma wrote:
          > http://en.wikipedia.org/wiki/JSON says: "The default character
          > encoding for JSON is UTF8; it also supports UTF16 and UTF32." but I'm
          > not sure about it because it is not mentioned explicitly on the
          > json.org page and it is also not very clear to me as to what exactly
          > that statement means. Does it mean that even though there is no \U
          > notation, I can directly input post-BMP codepoints as part of the
          > string literals? The json.org page does say "any-Unicode-character".
          > In this case even the \u notation is only there as a just-in-case?
          > (Even if so, why not \U too just-in-case?)

          Yes, you can put post-BMP codepoints directly as part of the string
          literals. If there are control characters outside of the BMP in your
          text they could still be encoded as surrogate pairs. This isn't the
          decision of JSON -- JavaScript uses UTF16 internally for all strings.

          Best Regards,

          David Kolf
        • John Cowan
          ... It can be represented either as the actual character, 4 bytes in any of UTF-8, UTF-16, or UTF-32; or else as two consecutive ASCII escapes: uD804 uDC05 .
          Message 4 of 8 , Apr 7, 2013
          • 0 Attachment
            Shriramana Sharma scripsit:

            > However, if that restriction of *four* hex digits is meant to be
            > enforced, then it means that post-BMP codepoints (such as 0x11005
            > BRAHMI LETTER A) cannot be represented in such strings directly, but
            > that they have to be manually (i.e. by the program outputting JSON-ed
            > data) decomposed into their equivalent UTF16 surrogate pairs (for
            > instance, 0xd804 0xdc05).

            It can be represented either as the actual character, 4 bytes in any
            of UTF-8, UTF-16, or UTF-32; or else as two consecutive ASCII escapes:
            "\uD804\uDC05".

            > IMHO this is an unnecessary restriction.

            JSON is backward compatible by design with ECMAScript 3, which does not
            support the \U escape.

            > Does it mean that even though there is no \U notation, I can directly
            > input post-BMP codepoints as part of the string literals?

            Correct.

            > In this case even the \u notation is only there as a just-in-case?

            Just so.

            --
            There are three kinds of people in the world: John Cowan
            those who can count, cowan@...
            and those who can't.
          • douglascrockford
            JavaScript, Java, and many other languages, were developed at a time when Unicode was going to be a 16-bit character set. Unicode later grew into a 21-bit
            Message 5 of 8 , Apr 7, 2013
            • 0 Attachment
              JavaScript, Java, and many other languages, were developed at a time when Unicode was going to be a 16-bit character set. Unicode later grew into a 21-bit character set.

              JSON took its representation of strings from JavaScript.
            • Dennis Gearon
              There s some contradiction in the json RFC. If the encoding shall be Unicode and default is UTF-8 as is stated, then ALL normal planes, including those
              Message 6 of 8 , Apr 7, 2013
              • 0 Attachment
                There's some contradiction in the json RFC. If the encoding 'shall be Unicode'
                and default is UTF-8 as is stated, then ALL normal planes, including those
                outside of the BMP can be encoded w/o any special escaping (excluding the
                special set chars escaped for JSON). UTF16 doesn't apply, right?

                See http://en.wikipedia.org/wiki/Plane_%28Unicode%29, where it says it is NOT a
                UTF-8 limit, being inside the BMP, but UTF16. It goes further to say that with
                ONLY 4 bytes, utf-8 can represent twice as many code points as UTF16 using
                surrogate pairs.


                If I have read everything correctly.

                Dennis Gearon


                Never, ever approach a computer saying or even thinking "I will just do this
                quickly."




                ________________________________
                From: douglascrockford <douglas@...>
                To: json@yahoogroups.com
                Sent: Sun, April 7, 2013 11:30:30 AM
                Subject: [json] Re: JSON strings cannot point to post-BMP Unicode codepoints?


                JavaScript, Java, and many other languages, were developed at a time when
                Unicode was going to be a 16-bit character set. Unicode later grew into a 21-bit
                character set.


                JSON took its representation of strings from JavaScript.




                [Non-text portions of this message have been removed]
              • John Cowan
                ... That s right. However, escapes are handy for representing stray Unicode characters that aren t easy to type, just as in HTML or XML. Unlike those
                Message 7 of 8 , Apr 8, 2013
                • 0 Attachment
                  Dennis Gearon scripsit:

                  > There's some contradiction in the json RFC. If the encoding 'shall be
                  > Unicode' and default is UTF-8 as is stated, then ALL normal planes,
                  > including those outside of the BMP can be encoded w/o any special
                  > escaping (excluding the special set chars escaped for JSON).

                  That's right. However, escapes are handy for representing stray Unicode
                  characters that aren't easy to type, just as in HTML or XML. Unlike
                  those languages, JSON requires two consecutive escapes to represent a
                  non-BMP character.

                  What's ambiguous is whether a JSON document like

                  ["\uD800"]

                  with an unpaired escaped surrogate, is valid or not. It is valid in
                  JavaScript. Crockford says it was not his intention to rule it out,
                  and I say it is implicitly forbidden by the definition in section 1 that
                  a string is a sequence of zero or more Unicode characters, because U+D800
                  is not a Unicode character.

                  > UTF16 doesn't apply, right?

                  UTF-16 is a perfectly cromulent encoding for JSON, though probably not
                  much used.

                  > See http://en.wikipedia.org/wiki/Plane_%28Unicode%29, where it says
                  > it is NOT a UTF-8 limit, being inside the BMP, but UTF16. It goes
                  > further to say that with ONLY 4 bytes, utf-8 can represent twice as
                  > many code points as UTF16 using surrogate pairs.

                  UTF-8 and UTF-16 can represent the exact same range of code points,
                  namely 0-10FFFF excluding D800-DFFF. Any UTF-8 byte sequence that
                  purports to represent any other code point has been illegal for a long
                  time now.

                  --
                  We pledge allegiance to the penguin John Cowan
                  and to the intellectual property regime cowan@...
                  for which he stands, one world under http://www.ccil.org/~cowan
                  Linux, with free music and open source
                  software for all. --Julian Dibbell on Brazil, edited
                • Shriramana Sharma
                  Hello people and thanks for your responses. I hope I understand correctly that compatibility with JavaScript and the ECMAScript standard dictates that it would
                  Message 8 of 8 , Apr 8, 2013
                  • 0 Attachment
                    Hello people and thanks for your responses. I hope I understand
                    correctly that compatibility with JavaScript and the ECMAScript
                    standard dictates that it would not be advisable for JSON to
                    unilaterally add the extension of \U, but since any valid Unicode
                    characters can be part of string literals (encoded in the appropriate
                    encoding) I guess this is not too much of a problem. Thank you.
                  Your message has been successfully submitted and would be delivered to recipients shortly.