Loading ...
Sorry, an error occurred while loading the content.

Escaping unicode characters

Expand Messages
  • patrickdlogan
    The spec reads... A string is a collection of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. A character is represented
    Message 1 of 9 , Aug 9, 2005
    • 0 Attachment
      The spec reads...

      "A string is a collection of zero or more Unicode characters, wrapped
      in double quotes, using backslash escapes. A character is represented
      as a single character string."

      Which implies all strings are always unicode. When should a json
      writer use the \uXXXX notation?

      For any character not in ascii? That seems prohibitively expensive to
      create and transmit.
    • Douglas Crockford
      ... The uXXXX notation must be used for control characters that do not have a shorter escape convention. This is the only case where uXXXX notation is
      Message 2 of 9 , Aug 10, 2005
      • 0 Attachment
        > The spec reads...
        >
        > "A string is a collection of zero or more Unicode characters, wrapped
        > in double quotes, using backslash escapes. A character is represented
        > as a single character string."
        >
        > Which implies all strings are always unicode. When should a JSON
        > writer use the \uXXXX notation?
        >
        > For any character not in ascii? That seems prohibitively expensive to
        > create and transmit.

        The \uXXXX notation must be used for control characters that do not
        have a shorter escape convention. This is the only case where \uXXXX
        notation is required.

        The \uXXXX notation may be used for any character.
      • Mark Miller
        ... As long as we re on this topic, what about Unicode characters that don t fit into 16 bits -- the co-called supplementary characters ?
        Message 3 of 9 , Aug 11, 2005
        • 0 Attachment
          Douglas Crockford wrote:
          > The \uXXXX notation must be used for control characters that do not
          > have a shorter escape convention. This is the only case where \uXXXX
          > notation is required.
          >
          > The \uXXXX notation may be used for any character.


          As long as we're on this topic, what about Unicode characters that don't fit
          into 16 bits -- the co-called "supplementary characters"?
          http://www.unicode.org/glossary/#supplementary_character

          Of the languages that JSON is intended to be a subset of, what do they do for
          supplementary characters?

          --
          Text by me above is hereby placed in the public domain

          Cheers,
          --MarkM
        • Roland H. Alden
          ... don t ... I think you mean surrogate characters; characters that require two 16 bit code points appearing sequentially ? If JSON isn t processing text
          Message 4 of 9 , Aug 11, 2005
          • 0 Attachment
            > As long as we're on this topic, what about Unicode characters that
            don't
            > fit into 16 bits -- the co-called "supplementary characters"?

            I think you mean "surrogate" characters; characters that require two 16
            bit code points appearing "sequentially"?

            If JSON isn't processing text but just carrying a Unicode payload for
            some higher layer software to deal with, then one would not think this
            bump in the otherwise smooth Unicode road would be important. In
            expressing string literals it should not be necessary for software at
            the layer of JSON to understand glyphic rendering issues.
          • Mark Miller
            ... By the terminology of the Unicode Glossary, , I do indeed mean supplementary characters . There are no surrogate
            Message 5 of 9 , Aug 11, 2005
            • 0 Attachment
              Roland H. Alden wrote:
              >>fit into 16 bits -- the co-called "supplementary characters"?
              >
              > I think you mean "surrogate" characters; characters that require two 16
              > bit code points appearing "sequentially"?

              By the terminology of the Unicode Glossary, <http://www.unicode.org/glossary>,
              I do indeed mean "supplementary characters". There are no "surrogate
              characters" only "surrogate code points". The characters whose code points fit
              in 16 bits are the "BMP characters". The others are the "supplementary
              characters". To encode Unicode characters in UTF-16, i.e., in 16-bit "code
              units", all the BMP characters encode as their code points. All the
              supplementary characters encode as a pair of surrogate code units. This is
              possible because there are no characters whose code points are the surrogate
              code points.


              > If JSON isn't processing text but just carrying a Unicode payload for
              > some higher layer software to deal with, then one would not think this
              > bump in the otherwise smooth Unicode road would be important. In
              > expressing string literals it should not be necessary for software at
              > the layer of JSON to understand glyphic rendering issues.

              Unicode has at least three distinct layers of abstraction: code points / code
              units, characters, and glyphs. Multiple characters may indeed combine to form
              a glyph. This is completely distinct from the combination of multiple UTF-16
              code units to form a character.

              The historical problem here is that Unicode standard used to say that all
              characters had 16-bit code points. Unicode early adopters, like Java, got
              screwed by believing them. Java 1.5 make the painful reconciliation with
              modern Unicode of saying that a Java "char" remains 16 bits but is not a
              character -- it is a UTF-16 code unit.
              http://java.sun.com/j2se/1.5.0/docs/guide/intl/overview.html#textrep
              Java, Javascript
              http://www-306.ibm.com/software/globalization/topics/javascript/processing.jsp
              and w3c DOM
              http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html#ID-C74D1578
              define string indexing in terms of counting UTF-16 code units.

              OTOH, Python
              http://www.python.org/peps/pep-0263.html
              http://www.python.org/peps/pep-0261.html
              and w3c XPath
              http://www.w3.org/TR/xpath#strings
              define string indexing in terms of counting Unicode characters. The IBM ICU
              library
              http://oss.software.ibm.com/icu4j/doc/com/ibm/icu/text/UTF16.html#findCodePointOffset(java.lang.String,%20int)
              supports both.

              This is an unholy mess. I suggest doing what E does
              http://www.erights.org/data/common-syntax/baking-chars.html#only_bmp
              which is state that, until a clear consensus emerges, JSON only supports BMP
              characters. JSON is supposed to be a subset of both Javascript and Python, and
              only the BMP characters are treated alike by both.

              Full disclosure: Since E does take the "only BMP" stance for now, if JSON
              takes this stance, E can continue to claim support for JSON. If JSON somehow
              allows supplementary characters, then E will no longer be able to make this
              claim. But then neither would one of Python or Javascript.

              --
              Text by me above is hereby placed in the public domain

              Cheers,
              --MarkM
            • Douglas Crockford
              ... supports BMP ... Python, and ... JSON ... somehow ... make this ... I don t think that JSON needs to care. It comes down to two questions: (A) How does a
              Message 6 of 9 , Aug 11, 2005
              • 0 Attachment
                > I suggest doing what E does
                > http://www.erights.org/data/common-syntax/baking-chars.html#only_bmp
                > which is state that, until a clear consensus emerges, JSON only
                supports BMP
                > characters. JSON is supposed to be a subset of both Javascript and
                Python, and
                > only the BMP characters are treated alike by both.
                >
                > Full disclosure: Since E does take the "only BMP" stance for now, if
                JSON
                > takes this stance, E can continue to claim support for JSON. If JSON
                somehow
                > allows supplementary characters, then E will no longer be able to
                make this
                > claim. But then neither would one of Python or Javascript.

                I don't think that JSON needs to care. It comes down to two questions:

                (A) How does a sender encode a supplementary character in UTF-16?

                (B) What does a receiver that is only able to handle BMP do with
                supplementary characters?

                The answer to (A) is obvious: use the two character surrogate
                encoding.

                I think the answer to (B) is the same.

                There are many languages, such as Java and JavaScript and E, that are
                unable to strictly do the right thing, but for now, the surrogate hack
                is the state of the art.

                JSON's interest is to get the data from here to there without
                distortion. JSON should be able to pass all of the Unicode characters,
                including the extended characters. If a receiver chooses to filter
                them out or replace them with surrogate pairs, that's its business.
              • Mark Miller
                ... The surrogate encoding uses two UTF-16 surrogate *code units* to encode a single character. These 16-bit surrogates are *not* characters. Your answer to
                Message 7 of 9 , Aug 11, 2005
                • 0 Attachment
                  Douglas Crockford wrote:
                  > I don't think that JSON needs to care. It comes down to two questions:
                  >
                  > (A) How does a sender encode a supplementary character in UTF-16?

                  > The answer to (A) is obvious: use the two character surrogate
                  > encoding.

                  The surrogate encoding uses two UTF-16 surrogate *code units* to encode a
                  single character. These 16-bit surrogates are *not* characters.

                  Your answer to (A), once rephrased, is indeed the correct answer to the (A)
                  question. But this is only relevant to JSON if you define a JSON string as
                  Java or Javascript does: as a \u encoding of a sequences of UTF-16 code units.
                  If a JSON string is a \u encoding of sequence of characters, as it is in
                  Python, then the UTF-16 question is not relevant. But the existing JSON spec
                  provides no way to do an Ascii encoding of the supplementary characters (such
                  as Python's \U encoding).


                  > (B) What does a receiver that is only able to handle BMP do with
                  > supplementary characters?
                  >
                  > I think the answer to (B) is the same.
                  >
                  > There are many languages, such as Java and JavaScript and E, that are
                  > unable to strictly do the right thing, but for now, the surrogate hack
                  > is the state of the art.
                  >
                  > JSON's interest is to get the data from here to there without
                  > distortion. JSON should be able to pass all of the Unicode characters,
                  > including the extended characters. If a receiver chooses to filter
                  > them out or replace them with surrogate pairs, that's its business.

                  To live up to this fine goal, JSON needs to define an Ascii encoding of the
                  supplementary characters. From (A), perhaps your intended answer is: Use the
                  \u encoding of the UTF-16 code point encoding of the supplementary characters.
                  This is Java's answer. AFAIK, it would be compatible with Javascript but not
                  with Python. This would be an adequate answer -- Java and Javascript both live
                  with it. I think the important thing is to make a definite choice.

                  --
                  Text by me above is hereby placed in the public domain

                  Cheers,
                  --MarkM
                • Douglas Crockford
                  ... of the ... Use the ... characters. ... but not ... both live ... That is the answer.
                  Message 8 of 9 , Aug 11, 2005
                  • 0 Attachment
                    > > JSON's interest is to get the data from here to there without
                    > > distortion. JSON should be able to pass all of the Unicode characters,
                    > > including the extended characters. If a receiver chooses to filter
                    > > them out or replace them with surrogate pairs, that's its business.

                    > To live up to this fine goal, JSON needs to define an Ascii encoding
                    of the
                    > supplementary characters. From (A), perhaps your intended answer is:
                    Use the
                    > \u encoding of the UTF-16 code point encoding of the supplementary
                    characters.
                    > This is Java's answer. AFAIK, it would be compatible with Javascript
                    but not
                    > with Python. This would be an adequate answer -- Java and Javascript
                    both live
                    > with it. I think the important thing is to make a definite choice.

                    That is the answer.
                  • jemptymethod
                    ... I don t know from Python, but it seems to me JSON has drifted significantly from Javascript. These discussions are all well and good, but if it means that
                    Message 9 of 9 , Aug 12, 2005
                    • 0 Attachment
                      --- In json@yahoogroups.com, Mark Miller <markm@c...> wrote:
                      >JSON is supposed to be a subset of both Javascript and Python

                      I don't know from Python, but it seems to me JSON has drifted
                      significantly from Javascript. These discussions are all well and
                      good, but if it means that the JSON spec is modified to the point
                      where it supports constructs that cannot be interpreted by
                      Javascript,
                      then it will cease being a subset thereof. Rather, JSON will become
                      an entity unto itself, rather than "Javascript Object Notation".
                    Your message has been successfully submitted and would be delivered to recipients shortly.