Loading ...
Sorry, an error occurred while loading the content.

1940JSON strings cannot point to post-BMP Unicode codepoints?

Expand Messages
  • Shriramana Sharma
    Apr 7, 2013
    • 0 Attachment
      Hello. I am thinking of using JSON for storing the data produced by my program.

      As I was reading through the website (json.org) I notice that the
      specification of strings mentions that \u followed by *four* hex
      digits can be used to represent Unicode codepoints. However, if that
      restriction of *four* hex digits is meant to be enforced, then it
      means that post-BMP codepoints (such as 0x11005 BRAHMI LETTER A)
      cannot be represented in such strings directly, but that they have to
      be manually (i.e. by the program outputting JSON-ed data) decomposed
      into their equivalent UTF16 surrogate pairs (for instance, 0xd804

      IMHO this is an unnecessary restriction. Modern standards (for
      instance Python 3, C11, C++11) allow post-BMP codepoints to be
      represented in string literals, using a capital U as in \U00011005. In
      fact, in C/C++ it is *prohibited* to use surrogate code points as part
      of a string literal. (A good idea which eliminates the possibility of
      unpaired surrogates altogether.)

      As a researcher interested in ancient scripts of South India I have to
      handle these SMP codepoints often, even entire texts in such scripts.
      Can JSON not support the \Uxxxxxxxx notation?

      http://en.wikipedia.org/wiki/JSON says: "The default character
      encoding for JSON is UTF8; it also supports UTF16 and UTF32." but I'm
      not sure about it because it is not mentioned explicitly on the
      json.org page and it is also not very clear to me as to what exactly
      that statement means. Does it mean that even though there is no \U
      notation, I can directly input post-BMP codepoints as part of the
      string literals? The json.org page does say "any-Unicode-character".
      In this case even the \u notation is only there as a just-in-case?
      (Even if so, why not \U too just-in-case?)

      TIA for your kind explanations and comments,

      Shriramana Sharma
    • Show all 8 messages in this topic