Loading ...
Sorry, an error occurred while loading the content.

RE: [json] Re: Escaping unicode characters

Expand Messages
  • Roland H. Alden
    ... don t ... I think you mean surrogate characters; characters that require two 16 bit code points appearing sequentially ? If JSON isn t processing text
    Message 1 of 9 , Aug 11, 2005
    • 0 Attachment
      > As long as we're on this topic, what about Unicode characters that
      don't
      > fit into 16 bits -- the co-called "supplementary characters"?

      I think you mean "surrogate" characters; characters that require two 16
      bit code points appearing "sequentially"?

      If JSON isn't processing text but just carrying a Unicode payload for
      some higher layer software to deal with, then one would not think this
      bump in the otherwise smooth Unicode road would be important. In
      expressing string literals it should not be necessary for software at
      the layer of JSON to understand glyphic rendering issues.
    • Mark Miller
      ... By the terminology of the Unicode Glossary, , I do indeed mean supplementary characters . There are no surrogate
      Message 2 of 9 , Aug 11, 2005
      • 0 Attachment
        Roland H. Alden wrote:
        >>fit into 16 bits -- the co-called "supplementary characters"?
        >
        > I think you mean "surrogate" characters; characters that require two 16
        > bit code points appearing "sequentially"?

        By the terminology of the Unicode Glossary, <http://www.unicode.org/glossary>,
        I do indeed mean "supplementary characters". There are no "surrogate
        characters" only "surrogate code points". The characters whose code points fit
        in 16 bits are the "BMP characters". The others are the "supplementary
        characters". To encode Unicode characters in UTF-16, i.e., in 16-bit "code
        units", all the BMP characters encode as their code points. All the
        supplementary characters encode as a pair of surrogate code units. This is
        possible because there are no characters whose code points are the surrogate
        code points.


        > If JSON isn't processing text but just carrying a Unicode payload for
        > some higher layer software to deal with, then one would not think this
        > bump in the otherwise smooth Unicode road would be important. In
        > expressing string literals it should not be necessary for software at
        > the layer of JSON to understand glyphic rendering issues.

        Unicode has at least three distinct layers of abstraction: code points / code
        units, characters, and glyphs. Multiple characters may indeed combine to form
        a glyph. This is completely distinct from the combination of multiple UTF-16
        code units to form a character.

        The historical problem here is that Unicode standard used to say that all
        characters had 16-bit code points. Unicode early adopters, like Java, got
        screwed by believing them. Java 1.5 make the painful reconciliation with
        modern Unicode of saying that a Java "char" remains 16 bits but is not a
        character -- it is a UTF-16 code unit.
        http://java.sun.com/j2se/1.5.0/docs/guide/intl/overview.html#textrep
        Java, Javascript
        http://www-306.ibm.com/software/globalization/topics/javascript/processing.jsp
        and w3c DOM
        http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html#ID-C74D1578
        define string indexing in terms of counting UTF-16 code units.

        OTOH, Python
        http://www.python.org/peps/pep-0263.html
        http://www.python.org/peps/pep-0261.html
        and w3c XPath
        http://www.w3.org/TR/xpath#strings
        define string indexing in terms of counting Unicode characters. The IBM ICU
        library
        http://oss.software.ibm.com/icu4j/doc/com/ibm/icu/text/UTF16.html#findCodePointOffset(java.lang.String,%20int)
        supports both.

        This is an unholy mess. I suggest doing what E does
        http://www.erights.org/data/common-syntax/baking-chars.html#only_bmp
        which is state that, until a clear consensus emerges, JSON only supports BMP
        characters. JSON is supposed to be a subset of both Javascript and Python, and
        only the BMP characters are treated alike by both.

        Full disclosure: Since E does take the "only BMP" stance for now, if JSON
        takes this stance, E can continue to claim support for JSON. If JSON somehow
        allows supplementary characters, then E will no longer be able to make this
        claim. But then neither would one of Python or Javascript.

        --
        Text by me above is hereby placed in the public domain

        Cheers,
        --MarkM
      • Douglas Crockford
        ... supports BMP ... Python, and ... JSON ... somehow ... make this ... I don t think that JSON needs to care. It comes down to two questions: (A) How does a
        Message 3 of 9 , Aug 11, 2005
        • 0 Attachment
          > I suggest doing what E does
          > http://www.erights.org/data/common-syntax/baking-chars.html#only_bmp
          > which is state that, until a clear consensus emerges, JSON only
          supports BMP
          > characters. JSON is supposed to be a subset of both Javascript and
          Python, and
          > only the BMP characters are treated alike by both.
          >
          > Full disclosure: Since E does take the "only BMP" stance for now, if
          JSON
          > takes this stance, E can continue to claim support for JSON. If JSON
          somehow
          > allows supplementary characters, then E will no longer be able to
          make this
          > claim. But then neither would one of Python or Javascript.

          I don't think that JSON needs to care. It comes down to two questions:

          (A) How does a sender encode a supplementary character in UTF-16?

          (B) What does a receiver that is only able to handle BMP do with
          supplementary characters?

          The answer to (A) is obvious: use the two character surrogate
          encoding.

          I think the answer to (B) is the same.

          There are many languages, such as Java and JavaScript and E, that are
          unable to strictly do the right thing, but for now, the surrogate hack
          is the state of the art.

          JSON's interest is to get the data from here to there without
          distortion. JSON should be able to pass all of the Unicode characters,
          including the extended characters. If a receiver chooses to filter
          them out or replace them with surrogate pairs, that's its business.
        • Mark Miller
          ... The surrogate encoding uses two UTF-16 surrogate *code units* to encode a single character. These 16-bit surrogates are *not* characters. Your answer to
          Message 4 of 9 , Aug 11, 2005
          • 0 Attachment
            Douglas Crockford wrote:
            > I don't think that JSON needs to care. It comes down to two questions:
            >
            > (A) How does a sender encode a supplementary character in UTF-16?

            > The answer to (A) is obvious: use the two character surrogate
            > encoding.

            The surrogate encoding uses two UTF-16 surrogate *code units* to encode a
            single character. These 16-bit surrogates are *not* characters.

            Your answer to (A), once rephrased, is indeed the correct answer to the (A)
            question. But this is only relevant to JSON if you define a JSON string as
            Java or Javascript does: as a \u encoding of a sequences of UTF-16 code units.
            If a JSON string is a \u encoding of sequence of characters, as it is in
            Python, then the UTF-16 question is not relevant. But the existing JSON spec
            provides no way to do an Ascii encoding of the supplementary characters (such
            as Python's \U encoding).


            > (B) What does a receiver that is only able to handle BMP do with
            > supplementary characters?
            >
            > I think the answer to (B) is the same.
            >
            > There are many languages, such as Java and JavaScript and E, that are
            > unable to strictly do the right thing, but for now, the surrogate hack
            > is the state of the art.
            >
            > JSON's interest is to get the data from here to there without
            > distortion. JSON should be able to pass all of the Unicode characters,
            > including the extended characters. If a receiver chooses to filter
            > them out or replace them with surrogate pairs, that's its business.

            To live up to this fine goal, JSON needs to define an Ascii encoding of the
            supplementary characters. From (A), perhaps your intended answer is: Use the
            \u encoding of the UTF-16 code point encoding of the supplementary characters.
            This is Java's answer. AFAIK, it would be compatible with Javascript but not
            with Python. This would be an adequate answer -- Java and Javascript both live
            with it. I think the important thing is to make a definite choice.

            --
            Text by me above is hereby placed in the public domain

            Cheers,
            --MarkM
          • Douglas Crockford
            ... of the ... Use the ... characters. ... but not ... both live ... That is the answer.
            Message 5 of 9 , Aug 11, 2005
            • 0 Attachment
              > > JSON's interest is to get the data from here to there without
              > > distortion. JSON should be able to pass all of the Unicode characters,
              > > including the extended characters. If a receiver chooses to filter
              > > them out or replace them with surrogate pairs, that's its business.

              > To live up to this fine goal, JSON needs to define an Ascii encoding
              of the
              > supplementary characters. From (A), perhaps your intended answer is:
              Use the
              > \u encoding of the UTF-16 code point encoding of the supplementary
              characters.
              > This is Java's answer. AFAIK, it would be compatible with Javascript
              but not
              > with Python. This would be an adequate answer -- Java and Javascript
              both live
              > with it. I think the important thing is to make a definite choice.

              That is the answer.
            • jemptymethod
              ... I don t know from Python, but it seems to me JSON has drifted significantly from Javascript. These discussions are all well and good, but if it means that
              Message 6 of 9 , Aug 12, 2005
              • 0 Attachment
                --- In json@yahoogroups.com, Mark Miller <markm@c...> wrote:
                >JSON is supposed to be a subset of both Javascript and Python

                I don't know from Python, but it seems to me JSON has drifted
                significantly from Javascript. These discussions are all well and
                good, but if it means that the JSON spec is modified to the point
                where it supports constructs that cannot be interpreted by
                Javascript,
                then it will cease being a subset thereof. Rather, JSON will become
                an entity unto itself, rather than "Javascript Object Notation".
              Your message has been successfully submitted and would be delivered to recipients shortly.