Re: [json] Re: JSON strings cannot point to post-BMP Unicode codepoints?
- There's some contradiction in the json RFC. If the encoding 'shall be Unicode'
and default is UTF-8 as is stated, then ALL normal planes, including those
outside of the BMP can be encoded w/o any special escaping (excluding the
special set chars escaped for JSON). UTF16 doesn't apply, right?
See http://en.wikipedia.org/wiki/Plane_%28Unicode%29, where it says it is NOT a
UTF-8 limit, being inside the BMP, but UTF16. It goes further to say that with
ONLY 4 bytes, utf-8 can represent twice as many code points as UTF16 using
If I have read everything correctly.
Never, ever approach a computer saying or even thinking "I will just do this
From: douglascrockford <douglas@...>
Sent: Sun, April 7, 2013 11:30:30 AM
Subject: [json] Re: JSON strings cannot point to post-BMP Unicode codepoints?
Unicode was going to be a 16-bit character set. Unicode later grew into a 21-bit
[Non-text portions of this message have been removed]
- Dennis Gearon scripsit:
> There's some contradiction in the json RFC. If the encoding 'shall beThat's right. However, escapes are handy for representing stray Unicode
> Unicode' and default is UTF-8 as is stated, then ALL normal planes,
> including those outside of the BMP can be encoded w/o any special
> escaping (excluding the special set chars escaped for JSON).
characters that aren't easy to type, just as in HTML or XML. Unlike
those languages, JSON requires two consecutive escapes to represent a
What's ambiguous is whether a JSON document like
with an unpaired escaped surrogate, is valid or not. It is valid in
and I say it is implicitly forbidden by the definition in section 1 that
a string is a sequence of zero or more Unicode characters, because U+D800
is not a Unicode character.
> UTF16 doesn't apply, right?UTF-16 is a perfectly cromulent encoding for JSON, though probably not
> See http://en.wikipedia.org/wiki/Plane_%28Unicode%29, where it saysUTF-8 and UTF-16 can represent the exact same range of code points,
> it is NOT a UTF-8 limit, being inside the BMP, but UTF16. It goes
> further to say that with ONLY 4 bytes, utf-8 can represent twice as
> many code points as UTF16 using surrogate pairs.
namely 0-10FFFF excluding D800-DFFF. Any UTF-8 byte sequence that
purports to represent any other code point has been illegal for a long
We pledge allegiance to the penguin John Cowan
and to the intellectual property regime cowan@...
for which he stands, one world under http://www.ccil.org/~cowan
Linux, with free music and open source
software for all. --Julian Dibbell on Brazil, edited
- Hello people and thanks for your responses. I hope I understand
standard dictates that it would not be advisable for JSON to
unilaterally add the extension of \U, but since any valid Unicode
characters can be part of string literals (encoded in the appropriate
encoding) I guess this is not too much of a problem. Thank you.