Re: SGF 4 encoding issue
- Hello there,
> The SGF 4 specification says that the CA property defines the charsetI see what you mean.
> used to encode property values of types "text" and "simpletext" and that
> property values themselves are to be delimited by the '[' and ']'
> characters. The problem I'm having is deciding when a textual property
> value is done -- that is, how does one recognize the terminating ']'
> character while reading a sequence of characters encoded in a non-ASCII
> encoding (or ASCII preserving encoding, such as ISO-8859-1)? Does one
> assume that the terminating ']' is expressed in the encoding specified
> in the CA property, or is the terminating ']' always in ASCII? If the
> terminating ']' is in ASCII, how can you tell it's the terminating ']'
> when the text encoding permits a 0x5D byte (']' in ASCII) as the initial
> octet of an encoded character?
Have you actually encountered that problem?
I thought that the most common encodings (ISO-8859-X, UTF-8 and UTF-16)
don't have a problem with this. If you have a counter-example I am eager
to learn about it (for ISO & UTF-8 I'm sure there is no such example,
but for UTF-16 I'm not 100% sure).
The trick is, that you read the property values according to the
encoding: e.g. let's say the text is "C[Bö]" (I hope you see the
Umlaut-o) and lets say the encoding XY makes this to:
"C [ 0x21 0xc7 0x5d ]". When you parse the value according to the XY
encoding and read it character by character (not byte by byte!) then the
first character will be "0x21"='B' and the second will be
"0xc7 0x5d"='ö'. When you then read the ']' you know you are finished.
Hm, just thinking: this way may not work when in UTF-16, because it will
not read the ']' but this and the next byte and make some funny
character out of it.
Well actually then, the spec should be updated and state that beginning
with the first property *after* the CA property the encoding is valid
for all of the SGF file not just for its values.
I guess there can be exceptions for UTF-8 and UTF-16, when they use
their magic numbers at the beginning of the file.
@list: does anyone else have an idea how to solve this issue?