Loading ...
Sorry, an error occurred while loading the content.

[Fwd: Re: [sgf-std] Re: SGF 4 encoding issue]]

Expand Messages
  • Arno Hollosi
    Forwarded for Lauri Paatero /Arno Hi all, This discussion is for those, who would like to get character set things logically
    Message 1 of 1 , Mar 14, 2004
      Forwarded for Lauri Paatero <lauri dot paatero at iki dot fi>

      Hi all,

      This discussion is for those, who would like to get character
      set things logically consistent. It seems quite possible to make
      working parser in other ways, as long as one limits set of
      possible charactersets a bit.

      I do not know if *EBCDIC* would be valid CA character set. If it is,
      it is quite good testbench for possible interpretations.

      I did have bit of thinking when implementing GOWrite 2 parser.

      Quote from spec:
      > SGF uses the US ASCII char-set for all its property identifiers and
      > property values,

      This is quite clear.

      > except SimpleText & Text. For SimpleText & Text the charset is
      > defined using the CA
      > <http://www.red-bean.com/sgf/properties.html#CA> property.

      "]" is NOT SimpleText or Text, it is part of SGF syntax. So it must be
      US ASCII.


      1. Order of properties in node is arbitrary.
      This means that CA property may come after property containing
      SimpleText & Text,
      for example player name PW may be first, and only then CA appears.
      This means that file parsing must be (mostly) done before knowledge of
      CA character set.

      2. Private properties may use US ASCII or CA character set.
      Other applications cannot know which coding is used, so for
      private properties interpretation cannot differ.


      Only sane interpretation (seems works in all cases) seems to be:

      First parse whole file structure using US ASCII.
      In this phase SimpleText & Text are considered as US ASCII in order to
      find "]" indicating the end of value. So quoting for ] in SimpleText &
      Text is based on intepreting them as US ASCII.

      Then for each SimpleText or Text value take the original data (before US
      ASCII interpretation) and interprete it using character set found in CA.

      Some further comments:

      16 bit unicode
      If a file is completely 16 bit unicode characters, it seems it cannot
      possibly SGF 4 file, as file structure is not US ASCII (unless all
      interleaving NULL bytes can be discarded...).
      Anyway in case of 16 bit unicode, CA property would be quite funny:
      Why use 8-bit characters in 16 bit unicode file?
      I cannot see how this could make any sense.

      Character set with states
      I think some character sets have (control?) characters affecting
      interpretation of
      following characters. So character interpretation does have state.
      Here we have problem whether this state is kept from SimpleText & Text,
      or it the state reverted to default after each SimpleText & Text.

      Reading whole file as character set CA
      - Data before first (; may be impossible to interprete using CA
      character set. For example UTF-8 cannot interprete all mail headers
      encoded using ISO-8859-1.
      - Private properties might not be possible to interprete using CA
      character set. (I am not quite sure if this is allowed in SGF 4...)

      Some open issues I have not thought too much
      I am not sure about encoding of ":":
      Detection of ":" could be done using US ASCII or CA defined charset.
      This affects both detection and quoting of : within SimpleText & Text.


      Arno Hollosi wrote:

      >> The SGF 4 specification says that the CA property defines the
      >> charset used to encode property values of types "text" and
      >> "simpletext" and that property values themselves are to be
      >> delimited by the '[' and ']' characters. The problem I'm having is
      >> deciding when a textual property value is done -- that is, how does
      >> one recognize the terminating ']' character while reading a
      >> sequence of characters encoded in a non-ASCII encoding (or ASCII
      >> preserving encoding, such as ISO-8859-1)? Does one assume that the
      >> terminating ']' is expressed in the encoding specified in the CA
      >> property, or is the terminating ']' always in ASCII? If the
      >> terminating ']' is in ASCII, how can you tell it's the terminating
      >> ']' when the text encoding permits a 0x5D byte (']' in ASCII) as
      >> the initial octet of an encoded character?
      > I see what you mean. Have you actually encountered that problem? I
      > thought that the most common encodings (ISO-8859-X, UTF-8 and UTF-16)
      > don't have a problem with this. If you have a counter-example I am
      > eager to learn about it (for ISO & UTF-8 I'm sure there is no such
      > example, but for UTF-16 I'm not 100% sure).
      > The trick is, that you read the property values according to the
      > encoding: e.g. let's say the text is "C[Bö]" (I hope you see the
      > Umlaut-o) and lets say the encoding XY makes this to: "C [ 0x21 0xc7
      > 0x5d ]". When you parse the value according to the XY encoding and
      > read it character by character (not byte by byte!) then the first
      > character will be "0x21"='B' and the second will be "0xc7 0x5d"='ö'.
      > When you then read the ']' you know you are finished.
      > Hm, just thinking: this way may not work when in UTF-16, because it
      > will not read the ']' but this and the next byte and make some funny
      > character out of it.
      > Well actually then, the spec should be updated and state that
      > beginning with the first property *after* the CA property the
      > encoding is valid for all of the SGF file not just for its values.
      > I guess there can be exceptions for UTF-8 and UTF-16, when they use
      > their magic numbers at the beginning of the file.
      > @list: does anyone else have an idea how to solve this issue?
      > /Arno
    Your message has been successfully submitted and would be delivered to recipients shortly.