Loading ...
Sorry, an error occurred while loading the content.

Encoding JSON in UTF-16 or UTF-32

Expand Messages
  • Paul J. Lucas
    ... Assume I have a valid reason to encode JSON as UTF-16BE (which is allowed). When doing so, is it still necessary to escape characters that are not in the
    Message 1 of 2 , Jul 24 8:31 PM
    • 0 Attachment
      The JSON RFC, section 2.5, says in part:

      > To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

      Assume I have a valid reason to encode JSON as UTF-16BE (which is allowed). When doing so, is it still necessary to escape characters that are not in the Basic Multilingual Plane? E.g., instead of this:

      00 5C 00 75 00 44 00 38 00 33 00 34 00 5C 00 75 00 44 00 44 00 31 00 45
      \ u D 8 3 4 \ u D D 1 E

      which is the 24-byte UTF-16BE byte sequence for \uD834\uDD1E, is it legal to do this:

      D8 34 DD 1E

      i.e., use the 4-byte UTF-16BE values directly?

      Similarly, if I were to encode the same JSON string as UTF-32BE, could I simply use the code-point value directly:

      00 01 D1 1E

      ?

      - Paul
    • Petri Lehtinen
      ... Yes you can. Regardless of encoding, you only have to escape , and the control characters U+0000 to U+001F, and any other character can be represented
      Message 2 of 2 , Jul 25 9:23 PM
      • 0 Attachment
        Paul J. Lucas wrote:
        > Assume I have a valid reason to encode JSON as UTF-16BE (which is
        > allowed). When doing so, is it still necessary to escape characters
        > that are not in the Basic Multilingual Plane? E.g., instead of this:
        >
        > 00 5C 00 75 00 44 00 38 00 33 00 34 00 5C 00 75 00 44 00 44 00 31 00 45
        > \ u D 8 3 4 \ u D D 1 E
        >
        > which is the 24-byte UTF-16BE byte sequence for \uD834\uDD1E, is it
        > legal to do this:
        >
        > D8 34 DD 1E
        >
        > i.e., use the 4-byte UTF-16BE values directly?
        >
        > Similarly, if I were to encode the same JSON string as UTF-32BE, could I simply use the code-point value directly:
        >
        > 00 01 D1 1E
        >
        > ?

        Yes you can. Regardless of encoding, you only have to escape ", \ and
        the control characters U+0000 to U+001F, and any other character can
        be represented directly.

        Petri
      Your message has been successfully submitted and would be delivered to recipients shortly.