1940JSON strings cannot point to post-BMP Unicode codepoints?
- Apr 7, 2013Hello. I am thinking of using JSON for storing the data produced by my program.
As I was reading through the website (json.org) I notice that the
specification of strings mentions that \u followed by *four* hex
digits can be used to represent Unicode codepoints. However, if that
restriction of *four* hex digits is meant to be enforced, then it
means that post-BMP codepoints (such as 0x11005 BRAHMI LETTER A)
cannot be represented in such strings directly, but that they have to
be manually (i.e. by the program outputting JSON-ed data) decomposed
into their equivalent UTF16 surrogate pairs (for instance, 0xd804
IMHO this is an unnecessary restriction. Modern standards (for
instance Python 3, C11, C++11) allow post-BMP codepoints to be
represented in string literals, using a capital U as in \U00011005. In
fact, in C/C++ it is *prohibited* to use surrogate code points as part
of a string literal. (A good idea which eliminates the possibility of
unpaired surrogates altogether.)
As a researcher interested in ancient scripts of South India I have to
handle these SMP codepoints often, even entire texts in such scripts.
Can JSON not support the \Uxxxxxxxx notation?
http://en.wikipedia.org/wiki/JSON says: "The default character
encoding for JSON is UTF8; it also supports UTF16 and UTF32." but I'm
not sure about it because it is not mentioned explicitly on the
json.org page and it is also not very clear to me as to what exactly
that statement means. Does it mean that even though there is no \U
notation, I can directly input post-BMP codepoints as part of the
string literals? The json.org page does say "any-Unicode-character".
In this case even the \u notation is only there as a just-in-case?
(Even if so, why not \U too just-in-case?)
TIA for your kind explanations and comments,
- Next post in topic >>