Loading ...
Sorry, an error occurred while loading the content.

Re: Escaping unicode characters

Expand Messages
  • Douglas Crockford
    ... supports BMP ... Python, and ... JSON ... somehow ... make this ... I don t think that JSON needs to care. It comes down to two questions: (A) How does a
    Message 1 of 9 , Aug 11, 2005
    • 0 Attachment
      > I suggest doing what E does
      > http://www.erights.org/data/common-syntax/baking-chars.html#only_bmp
      > which is state that, until a clear consensus emerges, JSON only
      supports BMP
      > characters. JSON is supposed to be a subset of both Javascript and
      Python, and
      > only the BMP characters are treated alike by both.
      >
      > Full disclosure: Since E does take the "only BMP" stance for now, if
      JSON
      > takes this stance, E can continue to claim support for JSON. If JSON
      somehow
      > allows supplementary characters, then E will no longer be able to
      make this
      > claim. But then neither would one of Python or Javascript.

      I don't think that JSON needs to care. It comes down to two questions:

      (A) How does a sender encode a supplementary character in UTF-16?

      (B) What does a receiver that is only able to handle BMP do with
      supplementary characters?

      The answer to (A) is obvious: use the two character surrogate
      encoding.

      I think the answer to (B) is the same.

      There are many languages, such as Java and JavaScript and E, that are
      unable to strictly do the right thing, but for now, the surrogate hack
      is the state of the art.

      JSON's interest is to get the data from here to there without
      distortion. JSON should be able to pass all of the Unicode characters,
      including the extended characters. If a receiver chooses to filter
      them out or replace them with surrogate pairs, that's its business.
    • Mark Miller
      ... The surrogate encoding uses two UTF-16 surrogate *code units* to encode a single character. These 16-bit surrogates are *not* characters. Your answer to
      Message 2 of 9 , Aug 11, 2005
      • 0 Attachment
        Douglas Crockford wrote:
        > I don't think that JSON needs to care. It comes down to two questions:
        >
        > (A) How does a sender encode a supplementary character in UTF-16?

        > The answer to (A) is obvious: use the two character surrogate
        > encoding.

        The surrogate encoding uses two UTF-16 surrogate *code units* to encode a
        single character. These 16-bit surrogates are *not* characters.

        Your answer to (A), once rephrased, is indeed the correct answer to the (A)
        question. But this is only relevant to JSON if you define a JSON string as
        Java or Javascript does: as a \u encoding of a sequences of UTF-16 code units.
        If a JSON string is a \u encoding of sequence of characters, as it is in
        Python, then the UTF-16 question is not relevant. But the existing JSON spec
        provides no way to do an Ascii encoding of the supplementary characters (such
        as Python's \U encoding).


        > (B) What does a receiver that is only able to handle BMP do with
        > supplementary characters?
        >
        > I think the answer to (B) is the same.
        >
        > There are many languages, such as Java and JavaScript and E, that are
        > unable to strictly do the right thing, but for now, the surrogate hack
        > is the state of the art.
        >
        > JSON's interest is to get the data from here to there without
        > distortion. JSON should be able to pass all of the Unicode characters,
        > including the extended characters. If a receiver chooses to filter
        > them out or replace them with surrogate pairs, that's its business.

        To live up to this fine goal, JSON needs to define an Ascii encoding of the
        supplementary characters. From (A), perhaps your intended answer is: Use the
        \u encoding of the UTF-16 code point encoding of the supplementary characters.
        This is Java's answer. AFAIK, it would be compatible with Javascript but not
        with Python. This would be an adequate answer -- Java and Javascript both live
        with it. I think the important thing is to make a definite choice.

        --
        Text by me above is hereby placed in the public domain

        Cheers,
        --MarkM
      • Douglas Crockford
        ... of the ... Use the ... characters. ... but not ... both live ... That is the answer.
        Message 3 of 9 , Aug 11, 2005
        • 0 Attachment
          > > JSON's interest is to get the data from here to there without
          > > distortion. JSON should be able to pass all of the Unicode characters,
          > > including the extended characters. If a receiver chooses to filter
          > > them out or replace them with surrogate pairs, that's its business.

          > To live up to this fine goal, JSON needs to define an Ascii encoding
          of the
          > supplementary characters. From (A), perhaps your intended answer is:
          Use the
          > \u encoding of the UTF-16 code point encoding of the supplementary
          characters.
          > This is Java's answer. AFAIK, it would be compatible with Javascript
          but not
          > with Python. This would be an adequate answer -- Java and Javascript
          both live
          > with it. I think the important thing is to make a definite choice.

          That is the answer.
        • jemptymethod
          ... I don t know from Python, but it seems to me JSON has drifted significantly from Javascript. These discussions are all well and good, but if it means that
          Message 4 of 9 , Aug 12, 2005
          • 0 Attachment
            --- In json@yahoogroups.com, Mark Miller <markm@c...> wrote:
            >JSON is supposed to be a subset of both Javascript and Python

            I don't know from Python, but it seems to me JSON has drifted
            significantly from Javascript. These discussions are all well and
            good, but if it means that the JSON spec is modified to the point
            where it supports constructs that cannot be interpreted by
            Javascript,
            then it will cease being a subset thereof. Rather, JSON will become
            an entity unto itself, rather than "Javascript Object Notation".
          Your message has been successfully submitted and would be delivered to recipients shortly.