Loading ...
Sorry, an error occurred while loading the content.

Re: [json] Re: Escaping unicode characters

Expand Messages
  • Mark Miller
    ... By the terminology of the Unicode Glossary, , I do indeed mean supplementary characters . There are no surrogate
    Message 1 of 9 , Aug 11, 2005
    • 0 Attachment
      Roland H. Alden wrote:
      >>fit into 16 bits -- the co-called "supplementary characters"?
      >
      > I think you mean "surrogate" characters; characters that require two 16
      > bit code points appearing "sequentially"?

      By the terminology of the Unicode Glossary, <http://www.unicode.org/glossary>,
      I do indeed mean "supplementary characters". There are no "surrogate
      characters" only "surrogate code points". The characters whose code points fit
      in 16 bits are the "BMP characters". The others are the "supplementary
      characters". To encode Unicode characters in UTF-16, i.e., in 16-bit "code
      units", all the BMP characters encode as their code points. All the
      supplementary characters encode as a pair of surrogate code units. This is
      possible because there are no characters whose code points are the surrogate
      code points.


      > If JSON isn't processing text but just carrying a Unicode payload for
      > some higher layer software to deal with, then one would not think this
      > bump in the otherwise smooth Unicode road would be important. In
      > expressing string literals it should not be necessary for software at
      > the layer of JSON to understand glyphic rendering issues.

      Unicode has at least three distinct layers of abstraction: code points / code
      units, characters, and glyphs. Multiple characters may indeed combine to form
      a glyph. This is completely distinct from the combination of multiple UTF-16
      code units to form a character.

      The historical problem here is that Unicode standard used to say that all
      characters had 16-bit code points. Unicode early adopters, like Java, got
      screwed by believing them. Java 1.5 make the painful reconciliation with
      modern Unicode of saying that a Java "char" remains 16 bits but is not a
      character -- it is a UTF-16 code unit.
      http://java.sun.com/j2se/1.5.0/docs/guide/intl/overview.html#textrep
      Java, Javascript
      http://www-306.ibm.com/software/globalization/topics/javascript/processing.jsp
      and w3c DOM
      http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html#ID-C74D1578
      define string indexing in terms of counting UTF-16 code units.

      OTOH, Python
      http://www.python.org/peps/pep-0263.html
      http://www.python.org/peps/pep-0261.html
      and w3c XPath
      http://www.w3.org/TR/xpath#strings
      define string indexing in terms of counting Unicode characters. The IBM ICU
      library
      http://oss.software.ibm.com/icu4j/doc/com/ibm/icu/text/UTF16.html#findCodePointOffset(java.lang.String,%20int)
      supports both.

      This is an unholy mess. I suggest doing what E does
      http://www.erights.org/data/common-syntax/baking-chars.html#only_bmp
      which is state that, until a clear consensus emerges, JSON only supports BMP
      characters. JSON is supposed to be a subset of both Javascript and Python, and
      only the BMP characters are treated alike by both.

      Full disclosure: Since E does take the "only BMP" stance for now, if JSON
      takes this stance, E can continue to claim support for JSON. If JSON somehow
      allows supplementary characters, then E will no longer be able to make this
      claim. But then neither would one of Python or Javascript.

      --
      Text by me above is hereby placed in the public domain

      Cheers,
      --MarkM
    • Douglas Crockford
      ... supports BMP ... Python, and ... JSON ... somehow ... make this ... I don t think that JSON needs to care. It comes down to two questions: (A) How does a
      Message 2 of 9 , Aug 11, 2005
      • 0 Attachment
        > I suggest doing what E does
        > http://www.erights.org/data/common-syntax/baking-chars.html#only_bmp
        > which is state that, until a clear consensus emerges, JSON only
        supports BMP
        > characters. JSON is supposed to be a subset of both Javascript and
        Python, and
        > only the BMP characters are treated alike by both.
        >
        > Full disclosure: Since E does take the "only BMP" stance for now, if
        JSON
        > takes this stance, E can continue to claim support for JSON. If JSON
        somehow
        > allows supplementary characters, then E will no longer be able to
        make this
        > claim. But then neither would one of Python or Javascript.

        I don't think that JSON needs to care. It comes down to two questions:

        (A) How does a sender encode a supplementary character in UTF-16?

        (B) What does a receiver that is only able to handle BMP do with
        supplementary characters?

        The answer to (A) is obvious: use the two character surrogate
        encoding.

        I think the answer to (B) is the same.

        There are many languages, such as Java and JavaScript and E, that are
        unable to strictly do the right thing, but for now, the surrogate hack
        is the state of the art.

        JSON's interest is to get the data from here to there without
        distortion. JSON should be able to pass all of the Unicode characters,
        including the extended characters. If a receiver chooses to filter
        them out or replace them with surrogate pairs, that's its business.
      • Mark Miller
        ... The surrogate encoding uses two UTF-16 surrogate *code units* to encode a single character. These 16-bit surrogates are *not* characters. Your answer to
        Message 3 of 9 , Aug 11, 2005
        • 0 Attachment
          Douglas Crockford wrote:
          > I don't think that JSON needs to care. It comes down to two questions:
          >
          > (A) How does a sender encode a supplementary character in UTF-16?

          > The answer to (A) is obvious: use the two character surrogate
          > encoding.

          The surrogate encoding uses two UTF-16 surrogate *code units* to encode a
          single character. These 16-bit surrogates are *not* characters.

          Your answer to (A), once rephrased, is indeed the correct answer to the (A)
          question. But this is only relevant to JSON if you define a JSON string as
          Java or Javascript does: as a \u encoding of a sequences of UTF-16 code units.
          If a JSON string is a \u encoding of sequence of characters, as it is in
          Python, then the UTF-16 question is not relevant. But the existing JSON spec
          provides no way to do an Ascii encoding of the supplementary characters (such
          as Python's \U encoding).


          > (B) What does a receiver that is only able to handle BMP do with
          > supplementary characters?
          >
          > I think the answer to (B) is the same.
          >
          > There are many languages, such as Java and JavaScript and E, that are
          > unable to strictly do the right thing, but for now, the surrogate hack
          > is the state of the art.
          >
          > JSON's interest is to get the data from here to there without
          > distortion. JSON should be able to pass all of the Unicode characters,
          > including the extended characters. If a receiver chooses to filter
          > them out or replace them with surrogate pairs, that's its business.

          To live up to this fine goal, JSON needs to define an Ascii encoding of the
          supplementary characters. From (A), perhaps your intended answer is: Use the
          \u encoding of the UTF-16 code point encoding of the supplementary characters.
          This is Java's answer. AFAIK, it would be compatible with Javascript but not
          with Python. This would be an adequate answer -- Java and Javascript both live
          with it. I think the important thing is to make a definite choice.

          --
          Text by me above is hereby placed in the public domain

          Cheers,
          --MarkM
        • Douglas Crockford
          ... of the ... Use the ... characters. ... but not ... both live ... That is the answer.
          Message 4 of 9 , Aug 11, 2005
          • 0 Attachment
            > > JSON's interest is to get the data from here to there without
            > > distortion. JSON should be able to pass all of the Unicode characters,
            > > including the extended characters. If a receiver chooses to filter
            > > them out or replace them with surrogate pairs, that's its business.

            > To live up to this fine goal, JSON needs to define an Ascii encoding
            of the
            > supplementary characters. From (A), perhaps your intended answer is:
            Use the
            > \u encoding of the UTF-16 code point encoding of the supplementary
            characters.
            > This is Java's answer. AFAIK, it would be compatible with Javascript
            but not
            > with Python. This would be an adequate answer -- Java and Javascript
            both live
            > with it. I think the important thing is to make a definite choice.

            That is the answer.
          • jemptymethod
            ... I don t know from Python, but it seems to me JSON has drifted significantly from Javascript. These discussions are all well and good, but if it means that
            Message 5 of 9 , Aug 12, 2005
            • 0 Attachment
              --- In json@yahoogroups.com, Mark Miller <markm@c...> wrote:
              >JSON is supposed to be a subset of both Javascript and Python

              I don't know from Python, but it seems to me JSON has drifted
              significantly from Javascript. These discussions are all well and
              good, but if it means that the JSON spec is modified to the point
              where it supports constructs that cannot be interpreted by
              Javascript,
              then it will cease being a subset thereof. Rather, JSON will become
              an entity unto itself, rather than "Javascript Object Notation".
            Your message has been successfully submitted and would be delivered to recipients shortly.