Loading ...
Sorry, an error occurred while loading the content.

Re: SGF 4 encoding issue

Expand Messages
  • Arno Hollosi
    Hello there, ... I see what you mean. Have you actually encountered that problem? I thought that the most common encodings (ISO-8859-X, UTF-8 and UTF-16) don t
    Message 1 of 1 , Mar 6, 2004
    View Source
    • 0 Attachment
      Hello there,

      > The SGF 4 specification says that the CA property defines the charset
      > used to encode property values of types "text" and "simpletext" and that
      > property values themselves are to be delimited by the '[' and ']'
      > characters. The problem I'm having is deciding when a textual property
      > value is done -- that is, how does one recognize the terminating ']'
      > character while reading a sequence of characters encoded in a non-ASCII
      > encoding (or ASCII preserving encoding, such as ISO-8859-1)? Does one
      > assume that the terminating ']' is expressed in the encoding specified
      > in the CA property, or is the terminating ']' always in ASCII? If the
      > terminating ']' is in ASCII, how can you tell it's the terminating ']'
      > when the text encoding permits a 0x5D byte (']' in ASCII) as the initial
      > octet of an encoded character?

      I see what you mean.
      Have you actually encountered that problem?
      I thought that the most common encodings (ISO-8859-X, UTF-8 and UTF-16)
      don't have a problem with this. If you have a counter-example I am eager
      to learn about it (for ISO & UTF-8 I'm sure there is no such example,
      but for UTF-16 I'm not 100% sure).

      The trick is, that you read the property values according to the
      encoding: e.g. let's say the text is "C[Bö]" (I hope you see the
      Umlaut-o) and lets say the encoding XY makes this to:
      "C [ 0x21 0xc7 0x5d ]". When you parse the value according to the XY
      encoding and read it character by character (not byte by byte!) then the
      first character will be "0x21"='B' and the second will be
      "0xc7 0x5d"='ö'. When you then read the ']' you know you are finished.

      Hm, just thinking: this way may not work when in UTF-16, because it will
      not read the ']' but this and the next byte and make some funny
      character out of it.

      Well actually then, the spec should be updated and state that beginning
      with the first property *after* the CA property the encoding is valid
      for all of the SGF file not just for its values.

      I guess there can be exceptions for UTF-8 and UTF-16, when they use
      their magic numbers at the beginning of the file.

      @list: does anyone else have an idea how to solve this issue?

      /Arno
    Your message has been successfully submitted and would be delivered to recipients shortly.