Loading ...
Sorry, an error occurred while loading the content.

Re: Escaping unicode characters

Expand Messages
  • Douglas Crockford
    ... The uXXXX notation must be used for control characters that do not have a shorter escape convention. This is the only case where uXXXX notation is
    Message 1 of 9 , Aug 10, 2005
    • 0 Attachment
      > The spec reads...
      >
      > "A string is a collection of zero or more Unicode characters, wrapped
      > in double quotes, using backslash escapes. A character is represented
      > as a single character string."
      >
      > Which implies all strings are always unicode. When should a JSON
      > writer use the \uXXXX notation?
      >
      > For any character not in ascii? That seems prohibitively expensive to
      > create and transmit.

      The \uXXXX notation must be used for control characters that do not
      have a shorter escape convention. This is the only case where \uXXXX
      notation is required.

      The \uXXXX notation may be used for any character.
    • Mark Miller
      ... As long as we re on this topic, what about Unicode characters that don t fit into 16 bits -- the co-called supplementary characters ?
      Message 2 of 9 , Aug 11, 2005
      • 0 Attachment
        Douglas Crockford wrote:
        > The \uXXXX notation must be used for control characters that do not
        > have a shorter escape convention. This is the only case where \uXXXX
        > notation is required.
        >
        > The \uXXXX notation may be used for any character.


        As long as we're on this topic, what about Unicode characters that don't fit
        into 16 bits -- the co-called "supplementary characters"?
        http://www.unicode.org/glossary/#supplementary_character

        Of the languages that JSON is intended to be a subset of, what do they do for
        supplementary characters?

        --
        Text by me above is hereby placed in the public domain

        Cheers,
        --MarkM
      • Roland H. Alden
        ... don t ... I think you mean surrogate characters; characters that require two 16 bit code points appearing sequentially ? If JSON isn t processing text
        Message 3 of 9 , Aug 11, 2005
        • 0 Attachment
          > As long as we're on this topic, what about Unicode characters that
          don't
          > fit into 16 bits -- the co-called "supplementary characters"?

          I think you mean "surrogate" characters; characters that require two 16
          bit code points appearing "sequentially"?

          If JSON isn't processing text but just carrying a Unicode payload for
          some higher layer software to deal with, then one would not think this
          bump in the otherwise smooth Unicode road would be important. In
          expressing string literals it should not be necessary for software at
          the layer of JSON to understand glyphic rendering issues.
        • Mark Miller
          ... By the terminology of the Unicode Glossary, , I do indeed mean supplementary characters . There are no surrogate
          Message 4 of 9 , Aug 11, 2005
          • 0 Attachment
            Roland H. Alden wrote:
            >>fit into 16 bits -- the co-called "supplementary characters"?
            >
            > I think you mean "surrogate" characters; characters that require two 16
            > bit code points appearing "sequentially"?

            By the terminology of the Unicode Glossary, <http://www.unicode.org/glossary>,
            I do indeed mean "supplementary characters". There are no "surrogate
            characters" only "surrogate code points". The characters whose code points fit
            in 16 bits are the "BMP characters". The others are the "supplementary
            characters". To encode Unicode characters in UTF-16, i.e., in 16-bit "code
            units", all the BMP characters encode as their code points. All the
            supplementary characters encode as a pair of surrogate code units. This is
            possible because there are no characters whose code points are the surrogate
            code points.


            > If JSON isn't processing text but just carrying a Unicode payload for
            > some higher layer software to deal with, then one would not think this
            > bump in the otherwise smooth Unicode road would be important. In
            > expressing string literals it should not be necessary for software at
            > the layer of JSON to understand glyphic rendering issues.

            Unicode has at least three distinct layers of abstraction: code points / code
            units, characters, and glyphs. Multiple characters may indeed combine to form
            a glyph. This is completely distinct from the combination of multiple UTF-16
            code units to form a character.

            The historical problem here is that Unicode standard used to say that all
            characters had 16-bit code points. Unicode early adopters, like Java, got
            screwed by believing them. Java 1.5 make the painful reconciliation with
            modern Unicode of saying that a Java "char" remains 16 bits but is not a
            character -- it is a UTF-16 code unit.
            http://java.sun.com/j2se/1.5.0/docs/guide/intl/overview.html#textrep
            Java, Javascript
            http://www-306.ibm.com/software/globalization/topics/javascript/processing.jsp
            and w3c DOM
            http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html#ID-C74D1578
            define string indexing in terms of counting UTF-16 code units.

            OTOH, Python
            http://www.python.org/peps/pep-0263.html
            http://www.python.org/peps/pep-0261.html
            and w3c XPath
            http://www.w3.org/TR/xpath#strings
            define string indexing in terms of counting Unicode characters. The IBM ICU
            library
            http://oss.software.ibm.com/icu4j/doc/com/ibm/icu/text/UTF16.html#findCodePointOffset(java.lang.String,%20int)
            supports both.

            This is an unholy mess. I suggest doing what E does
            http://www.erights.org/data/common-syntax/baking-chars.html#only_bmp
            which is state that, until a clear consensus emerges, JSON only supports BMP
            characters. JSON is supposed to be a subset of both Javascript and Python, and
            only the BMP characters are treated alike by both.

            Full disclosure: Since E does take the "only BMP" stance for now, if JSON
            takes this stance, E can continue to claim support for JSON. If JSON somehow
            allows supplementary characters, then E will no longer be able to make this
            claim. But then neither would one of Python or Javascript.

            --
            Text by me above is hereby placed in the public domain

            Cheers,
            --MarkM
          • Douglas Crockford
            ... supports BMP ... Python, and ... JSON ... somehow ... make this ... I don t think that JSON needs to care. It comes down to two questions: (A) How does a
            Message 5 of 9 , Aug 11, 2005
            • 0 Attachment
              > I suggest doing what E does
              > http://www.erights.org/data/common-syntax/baking-chars.html#only_bmp
              > which is state that, until a clear consensus emerges, JSON only
              supports BMP
              > characters. JSON is supposed to be a subset of both Javascript and
              Python, and
              > only the BMP characters are treated alike by both.
              >
              > Full disclosure: Since E does take the "only BMP" stance for now, if
              JSON
              > takes this stance, E can continue to claim support for JSON. If JSON
              somehow
              > allows supplementary characters, then E will no longer be able to
              make this
              > claim. But then neither would one of Python or Javascript.

              I don't think that JSON needs to care. It comes down to two questions:

              (A) How does a sender encode a supplementary character in UTF-16?

              (B) What does a receiver that is only able to handle BMP do with
              supplementary characters?

              The answer to (A) is obvious: use the two character surrogate
              encoding.

              I think the answer to (B) is the same.

              There are many languages, such as Java and JavaScript and E, that are
              unable to strictly do the right thing, but for now, the surrogate hack
              is the state of the art.

              JSON's interest is to get the data from here to there without
              distortion. JSON should be able to pass all of the Unicode characters,
              including the extended characters. If a receiver chooses to filter
              them out or replace them with surrogate pairs, that's its business.
            • Mark Miller
              ... The surrogate encoding uses two UTF-16 surrogate *code units* to encode a single character. These 16-bit surrogates are *not* characters. Your answer to
              Message 6 of 9 , Aug 11, 2005
              • 0 Attachment
                Douglas Crockford wrote:
                > I don't think that JSON needs to care. It comes down to two questions:
                >
                > (A) How does a sender encode a supplementary character in UTF-16?

                > The answer to (A) is obvious: use the two character surrogate
                > encoding.

                The surrogate encoding uses two UTF-16 surrogate *code units* to encode a
                single character. These 16-bit surrogates are *not* characters.

                Your answer to (A), once rephrased, is indeed the correct answer to the (A)
                question. But this is only relevant to JSON if you define a JSON string as
                Java or Javascript does: as a \u encoding of a sequences of UTF-16 code units.
                If a JSON string is a \u encoding of sequence of characters, as it is in
                Python, then the UTF-16 question is not relevant. But the existing JSON spec
                provides no way to do an Ascii encoding of the supplementary characters (such
                as Python's \U encoding).


                > (B) What does a receiver that is only able to handle BMP do with
                > supplementary characters?
                >
                > I think the answer to (B) is the same.
                >
                > There are many languages, such as Java and JavaScript and E, that are
                > unable to strictly do the right thing, but for now, the surrogate hack
                > is the state of the art.
                >
                > JSON's interest is to get the data from here to there without
                > distortion. JSON should be able to pass all of the Unicode characters,
                > including the extended characters. If a receiver chooses to filter
                > them out or replace them with surrogate pairs, that's its business.

                To live up to this fine goal, JSON needs to define an Ascii encoding of the
                supplementary characters. From (A), perhaps your intended answer is: Use the
                \u encoding of the UTF-16 code point encoding of the supplementary characters.
                This is Java's answer. AFAIK, it would be compatible with Javascript but not
                with Python. This would be an adequate answer -- Java and Javascript both live
                with it. I think the important thing is to make a definite choice.

                --
                Text by me above is hereby placed in the public domain

                Cheers,
                --MarkM
              • Douglas Crockford
                ... of the ... Use the ... characters. ... but not ... both live ... That is the answer.
                Message 7 of 9 , Aug 11, 2005
                • 0 Attachment
                  > > JSON's interest is to get the data from here to there without
                  > > distortion. JSON should be able to pass all of the Unicode characters,
                  > > including the extended characters. If a receiver chooses to filter
                  > > them out or replace them with surrogate pairs, that's its business.

                  > To live up to this fine goal, JSON needs to define an Ascii encoding
                  of the
                  > supplementary characters. From (A), perhaps your intended answer is:
                  Use the
                  > \u encoding of the UTF-16 code point encoding of the supplementary
                  characters.
                  > This is Java's answer. AFAIK, it would be compatible with Javascript
                  but not
                  > with Python. This would be an adequate answer -- Java and Javascript
                  both live
                  > with it. I think the important thing is to make a definite choice.

                  That is the answer.
                • jemptymethod
                  ... I don t know from Python, but it seems to me JSON has drifted significantly from Javascript. These discussions are all well and good, but if it means that
                  Message 8 of 9 , Aug 12, 2005
                  • 0 Attachment
                    --- In json@yahoogroups.com, Mark Miller <markm@c...> wrote:
                    >JSON is supposed to be a subset of both Javascript and Python

                    I don't know from Python, but it seems to me JSON has drifted
                    significantly from Javascript. These discussions are all well and
                    good, but if it means that the JSON spec is modified to the point
                    where it supports constructs that cannot be interpreted by
                    Javascript,
                    then it will cease being a subset thereof. Rather, JSON will become
                    an entity unto itself, rather than "Javascript Object Notation".
                  Your message has been successfully submitted and would be delivered to recipients shortly.