Loading ...
Sorry, an error occurred while loading the content.

Re: [json] Re: Escaping unicode characters

Expand Messages
  • Mark Miller
    ... As long as we re on this topic, what about Unicode characters that don t fit into 16 bits -- the co-called supplementary characters ?
    Message 1 of 9 , Aug 11, 2005
    • 0 Attachment
      Douglas Crockford wrote:
      > The \uXXXX notation must be used for control characters that do not
      > have a shorter escape convention. This is the only case where \uXXXX
      > notation is required.
      >
      > The \uXXXX notation may be used for any character.


      As long as we're on this topic, what about Unicode characters that don't fit
      into 16 bits -- the co-called "supplementary characters"?
      http://www.unicode.org/glossary/#supplementary_character

      Of the languages that JSON is intended to be a subset of, what do they do for
      supplementary characters?

      --
      Text by me above is hereby placed in the public domain

      Cheers,
      --MarkM
    • Roland H. Alden
      ... don t ... I think you mean surrogate characters; characters that require two 16 bit code points appearing sequentially ? If JSON isn t processing text
      Message 2 of 9 , Aug 11, 2005
      • 0 Attachment
        > As long as we're on this topic, what about Unicode characters that
        don't
        > fit into 16 bits -- the co-called "supplementary characters"?

        I think you mean "surrogate" characters; characters that require two 16
        bit code points appearing "sequentially"?

        If JSON isn't processing text but just carrying a Unicode payload for
        some higher layer software to deal with, then one would not think this
        bump in the otherwise smooth Unicode road would be important. In
        expressing string literals it should not be necessary for software at
        the layer of JSON to understand glyphic rendering issues.
      • Mark Miller
        ... By the terminology of the Unicode Glossary, , I do indeed mean supplementary characters . There are no surrogate
        Message 3 of 9 , Aug 11, 2005
        • 0 Attachment
          Roland H. Alden wrote:
          >>fit into 16 bits -- the co-called "supplementary characters"?
          >
          > I think you mean "surrogate" characters; characters that require two 16
          > bit code points appearing "sequentially"?

          By the terminology of the Unicode Glossary, <http://www.unicode.org/glossary>,
          I do indeed mean "supplementary characters". There are no "surrogate
          characters" only "surrogate code points". The characters whose code points fit
          in 16 bits are the "BMP characters". The others are the "supplementary
          characters". To encode Unicode characters in UTF-16, i.e., in 16-bit "code
          units", all the BMP characters encode as their code points. All the
          supplementary characters encode as a pair of surrogate code units. This is
          possible because there are no characters whose code points are the surrogate
          code points.


          > If JSON isn't processing text but just carrying a Unicode payload for
          > some higher layer software to deal with, then one would not think this
          > bump in the otherwise smooth Unicode road would be important. In
          > expressing string literals it should not be necessary for software at
          > the layer of JSON to understand glyphic rendering issues.

          Unicode has at least three distinct layers of abstraction: code points / code
          units, characters, and glyphs. Multiple characters may indeed combine to form
          a glyph. This is completely distinct from the combination of multiple UTF-16
          code units to form a character.

          The historical problem here is that Unicode standard used to say that all
          characters had 16-bit code points. Unicode early adopters, like Java, got
          screwed by believing them. Java 1.5 make the painful reconciliation with
          modern Unicode of saying that a Java "char" remains 16 bits but is not a
          character -- it is a UTF-16 code unit.
          http://java.sun.com/j2se/1.5.0/docs/guide/intl/overview.html#textrep
          Java, Javascript
          http://www-306.ibm.com/software/globalization/topics/javascript/processing.jsp
          and w3c DOM
          http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html#ID-C74D1578
          define string indexing in terms of counting UTF-16 code units.

          OTOH, Python
          http://www.python.org/peps/pep-0263.html
          http://www.python.org/peps/pep-0261.html
          and w3c XPath
          http://www.w3.org/TR/xpath#strings
          define string indexing in terms of counting Unicode characters. The IBM ICU
          library
          http://oss.software.ibm.com/icu4j/doc/com/ibm/icu/text/UTF16.html#findCodePointOffset(java.lang.String,%20int)
          supports both.

          This is an unholy mess. I suggest doing what E does
          http://www.erights.org/data/common-syntax/baking-chars.html#only_bmp
          which is state that, until a clear consensus emerges, JSON only supports BMP
          characters. JSON is supposed to be a subset of both Javascript and Python, and
          only the BMP characters are treated alike by both.

          Full disclosure: Since E does take the "only BMP" stance for now, if JSON
          takes this stance, E can continue to claim support for JSON. If JSON somehow
          allows supplementary characters, then E will no longer be able to make this
          claim. But then neither would one of Python or Javascript.

          --
          Text by me above is hereby placed in the public domain

          Cheers,
          --MarkM
        • Douglas Crockford
          ... supports BMP ... Python, and ... JSON ... somehow ... make this ... I don t think that JSON needs to care. It comes down to two questions: (A) How does a
          Message 4 of 9 , Aug 11, 2005
          • 0 Attachment
            > I suggest doing what E does
            > http://www.erights.org/data/common-syntax/baking-chars.html#only_bmp
            > which is state that, until a clear consensus emerges, JSON only
            supports BMP
            > characters. JSON is supposed to be a subset of both Javascript and
            Python, and
            > only the BMP characters are treated alike by both.
            >
            > Full disclosure: Since E does take the "only BMP" stance for now, if
            JSON
            > takes this stance, E can continue to claim support for JSON. If JSON
            somehow
            > allows supplementary characters, then E will no longer be able to
            make this
            > claim. But then neither would one of Python or Javascript.

            I don't think that JSON needs to care. It comes down to two questions:

            (A) How does a sender encode a supplementary character in UTF-16?

            (B) What does a receiver that is only able to handle BMP do with
            supplementary characters?

            The answer to (A) is obvious: use the two character surrogate
            encoding.

            I think the answer to (B) is the same.

            There are many languages, such as Java and JavaScript and E, that are
            unable to strictly do the right thing, but for now, the surrogate hack
            is the state of the art.

            JSON's interest is to get the data from here to there without
            distortion. JSON should be able to pass all of the Unicode characters,
            including the extended characters. If a receiver chooses to filter
            them out or replace them with surrogate pairs, that's its business.
          • Mark Miller
            ... The surrogate encoding uses two UTF-16 surrogate *code units* to encode a single character. These 16-bit surrogates are *not* characters. Your answer to
            Message 5 of 9 , Aug 11, 2005
            • 0 Attachment
              Douglas Crockford wrote:
              > I don't think that JSON needs to care. It comes down to two questions:
              >
              > (A) How does a sender encode a supplementary character in UTF-16?

              > The answer to (A) is obvious: use the two character surrogate
              > encoding.

              The surrogate encoding uses two UTF-16 surrogate *code units* to encode a
              single character. These 16-bit surrogates are *not* characters.

              Your answer to (A), once rephrased, is indeed the correct answer to the (A)
              question. But this is only relevant to JSON if you define a JSON string as
              Java or Javascript does: as a \u encoding of a sequences of UTF-16 code units.
              If a JSON string is a \u encoding of sequence of characters, as it is in
              Python, then the UTF-16 question is not relevant. But the existing JSON spec
              provides no way to do an Ascii encoding of the supplementary characters (such
              as Python's \U encoding).


              > (B) What does a receiver that is only able to handle BMP do with
              > supplementary characters?
              >
              > I think the answer to (B) is the same.
              >
              > There are many languages, such as Java and JavaScript and E, that are
              > unable to strictly do the right thing, but for now, the surrogate hack
              > is the state of the art.
              >
              > JSON's interest is to get the data from here to there without
              > distortion. JSON should be able to pass all of the Unicode characters,
              > including the extended characters. If a receiver chooses to filter
              > them out or replace them with surrogate pairs, that's its business.

              To live up to this fine goal, JSON needs to define an Ascii encoding of the
              supplementary characters. From (A), perhaps your intended answer is: Use the
              \u encoding of the UTF-16 code point encoding of the supplementary characters.
              This is Java's answer. AFAIK, it would be compatible with Javascript but not
              with Python. This would be an adequate answer -- Java and Javascript both live
              with it. I think the important thing is to make a definite choice.

              --
              Text by me above is hereby placed in the public domain

              Cheers,
              --MarkM
            • Douglas Crockford
              ... of the ... Use the ... characters. ... but not ... both live ... That is the answer.
              Message 6 of 9 , Aug 11, 2005
              • 0 Attachment
                > > JSON's interest is to get the data from here to there without
                > > distortion. JSON should be able to pass all of the Unicode characters,
                > > including the extended characters. If a receiver chooses to filter
                > > them out or replace them with surrogate pairs, that's its business.

                > To live up to this fine goal, JSON needs to define an Ascii encoding
                of the
                > supplementary characters. From (A), perhaps your intended answer is:
                Use the
                > \u encoding of the UTF-16 code point encoding of the supplementary
                characters.
                > This is Java's answer. AFAIK, it would be compatible with Javascript
                but not
                > with Python. This would be an adequate answer -- Java and Javascript
                both live
                > with it. I think the important thing is to make a definite choice.

                That is the answer.
              • jemptymethod
                ... I don t know from Python, but it seems to me JSON has drifted significantly from Javascript. These discussions are all well and good, but if it means that
                Message 7 of 9 , Aug 12, 2005
                • 0 Attachment
                  --- In json@yahoogroups.com, Mark Miller <markm@c...> wrote:
                  >JSON is supposed to be a subset of both Javascript and Python

                  I don't know from Python, but it seems to me JSON has drifted
                  significantly from Javascript. These discussions are all well and
                  good, but if it means that the JSON spec is modified to the point
                  where it supports constructs that cannot be interpreted by
                  Javascript,
                  then it will cease being a subset thereof. Rather, JSON will become
                  an entity unto itself, rather than "Javascript Object Notation".
                Your message has been successfully submitted and would be delivered to recipients shortly.