Loading ...
Sorry, an error occurred while loading the content.

Re: JSON strings cannot point to post-BMP Unicode codepoints?

Expand Messages
  • douglascrockford
    JavaScript, Java, and many other languages, were developed at a time when Unicode was going to be a 16-bit character set. Unicode later grew into a 21-bit
    Message 1 of 8 , Apr 7 11:30 AM
    • 0 Attachment
      JavaScript, Java, and many other languages, were developed at a time when Unicode was going to be a 16-bit character set. Unicode later grew into a 21-bit character set.

      JSON took its representation of strings from JavaScript.
    • Dennis Gearon
      There s some contradiction in the json RFC. If the encoding shall be Unicode and default is UTF-8 as is stated, then ALL normal planes, including those
      Message 2 of 8 , Apr 7 9:41 PM
      • 0 Attachment
        There's some contradiction in the json RFC. If the encoding 'shall be Unicode'
        and default is UTF-8 as is stated, then ALL normal planes, including those
        outside of the BMP can be encoded w/o any special escaping (excluding the
        special set chars escaped for JSON). UTF16 doesn't apply, right?

        See http://en.wikipedia.org/wiki/Plane_%28Unicode%29, where it says it is NOT a
        UTF-8 limit, being inside the BMP, but UTF16. It goes further to say that with
        ONLY 4 bytes, utf-8 can represent twice as many code points as UTF16 using
        surrogate pairs.


        If I have read everything correctly.

        Dennis Gearon


        Never, ever approach a computer saying or even thinking "I will just do this
        quickly."




        ________________________________
        From: douglascrockford <douglas@...>
        To: json@yahoogroups.com
        Sent: Sun, April 7, 2013 11:30:30 AM
        Subject: [json] Re: JSON strings cannot point to post-BMP Unicode codepoints?


        JavaScript, Java, and many other languages, were developed at a time when
        Unicode was going to be a 16-bit character set. Unicode later grew into a 21-bit
        character set.


        JSON took its representation of strings from JavaScript.




        [Non-text portions of this message have been removed]
      • John Cowan
        ... That s right. However, escapes are handy for representing stray Unicode characters that aren t easy to type, just as in HTML or XML. Unlike those
        Message 3 of 8 , Apr 8 2:11 AM
        • 0 Attachment
          Dennis Gearon scripsit:

          > There's some contradiction in the json RFC. If the encoding 'shall be
          > Unicode' and default is UTF-8 as is stated, then ALL normal planes,
          > including those outside of the BMP can be encoded w/o any special
          > escaping (excluding the special set chars escaped for JSON).

          That's right. However, escapes are handy for representing stray Unicode
          characters that aren't easy to type, just as in HTML or XML. Unlike
          those languages, JSON requires two consecutive escapes to represent a
          non-BMP character.

          What's ambiguous is whether a JSON document like

          ["\uD800"]

          with an unpaired escaped surrogate, is valid or not. It is valid in
          JavaScript. Crockford says it was not his intention to rule it out,
          and I say it is implicitly forbidden by the definition in section 1 that
          a string is a sequence of zero or more Unicode characters, because U+D800
          is not a Unicode character.

          > UTF16 doesn't apply, right?

          UTF-16 is a perfectly cromulent encoding for JSON, though probably not
          much used.

          > See http://en.wikipedia.org/wiki/Plane_%28Unicode%29, where it says
          > it is NOT a UTF-8 limit, being inside the BMP, but UTF16. It goes
          > further to say that with ONLY 4 bytes, utf-8 can represent twice as
          > many code points as UTF16 using surrogate pairs.

          UTF-8 and UTF-16 can represent the exact same range of code points,
          namely 0-10FFFF excluding D800-DFFF. Any UTF-8 byte sequence that
          purports to represent any other code point has been illegal for a long
          time now.

          --
          We pledge allegiance to the penguin John Cowan
          and to the intellectual property regime cowan@...
          for which he stands, one world under http://www.ccil.org/~cowan
          Linux, with free music and open source
          software for all. --Julian Dibbell on Brazil, edited
        • Shriramana Sharma
          Hello people and thanks for your responses. I hope I understand correctly that compatibility with JavaScript and the ECMAScript standard dictates that it would
          Message 4 of 8 , Apr 8 6:27 AM
          • 0 Attachment
            Hello people and thanks for your responses. I hope I understand
            correctly that compatibility with JavaScript and the ECMAScript
            standard dictates that it would not be advisable for JSON to
            unilaterally add the extension of \U, but since any valid Unicode
            characters can be part of string literals (encoded in the appropriate
            encoding) I guess this is not too much of a problem. Thank you.
          Your message has been successfully submitted and would be delivered to recipients shortly.