Loading ...
Sorry, an error occurred while loading the content.

Unicode

Expand Messages
  • Douglas Crockford
    Unicode defines a set of character code points between U+0000 and U+10FFFF, organized into 17 planes of 64K each. 0 BMP Basic Multilingual Plane 1 SMP
    Message 1 of 1 , Jan 17, 2006
    View Source
    • 0 Attachment
      Unicode defines a set of character code points between U+0000 and U+10FFFF, organized into 17 planes of 64K each.
      0 BMP Basic Multilingual Plane 1 SMP Supplementary Multilingual Plane 2 SIP Supplementary Ideographic Plane 14 SSP Supplementary Special-purpose Plane 15 Private Use Plane 16 Private Use Plane
      There are 3 encoding schemes. Each is able to represent any sequence of characters. They differ in their relative efficiency and convenience.
      byte UTF-8 1-4 short UTF-16 1-2 int UTF-32 1
      When Java and JavaScript were designed, Unicode was only going to have a single plane, which would allow all characters to be represented in 16 bits. Then Unicode grew. Much of the complexity of Unicode now is in supporting the extended characters that are not in the Basic Multilingual Plane. In UTF-32, support for the extended characters is trivial. All characters can be represented in 32 bits. In UTF-8, support for extended characters is easy. An extended character is represented as a 4 byte sequence. (Most asian characters are represented as a 3 bytes sequence. Most european characters are represented as a 2 byte sequence. ASCII is represented as a single byte.)

      In UTF-16, which is the internal encoding used in Java and JavaScript, an extended character is represented as a surrogate pair.

      When converting UTF-8 to UTF-16, an extended character must be converted from a 4 byte sequence to a 2 short surrogate pair. When converting UTF-16 to UTF-8, a 2 short surrogate pair must be converted into a 4 byte sequence.

      JSON, because it is based on JavaScript, uses \uXXXX as an alternate representation for characters in strings. The four hex characters constrain it to UTF-16: All characters between U+0000 and U+FFFF may be represented by a \uXXXX unit. Extended characters may be represented as a \uXXXX\uXXXX surrogate pair.



      [Non-text portions of this message have been removed]
    Your message has been successfully submitted and would be delivered to recipients shortly.