Re: [XSL-FO] Managing Character Sets (was question .....warning signs!!)
- Dave Pawson wrote:
> At 15:30 28/10/2003, you wrote:But this may be a bad data issue, not an encoding issue in many cases:
>>If the input to your process is always XML then encoding shouldn't be an
>>issue *as long as* your XML processor can handle the input encodings.
>>This is because XML documents are required to declare their encoding if
>>it is not UTF-8, therefore there should never be any question as to the
>>encoding of a document on input.
> That may be a requirement, but its a considerate supplier who provides it Eliot
> Too many apps fall over on one odd character in 10000, which is .... not
> easy to find,
> yet totally stops the application.
> Hence the caution.
that is, the file is a UTF-8 file but happens to have a bad byte in it
somewhere, for whatever reason, or is a valid UTF-8 file but has a
non-XML character (remember that XML excludes a few Unicode characters,
although now I can't remember why). In that case you have to work with
your data supplier to figure out why the problem is occurring.
I understand the issue--I've had it myself, but my point is that it's
nothing to do with XSL-FO and whether or not it's an appropriate technology.
This is an issue of supplier/client management, for which we can offer
no help on this list.
> And plead with suppiers to take the same care, and hope their applicationsI avoid MS tools whenever possible, so I wouldn't know. But this should
> dont' transcode on the way through (which I believe some MS tools do?)
not be an issue for conforming XML processors. Before you get to XML
you're on your own :-)
> I've found SC Unipad a good buy, almost solely for this purpose.If you're doing this, you are not getting paid enough. But this is a
> Export as text from Adobe reader 6, and try to markup.
data conversion issue, not an XML or XSL-FO issue and there's really
nothing short of divine intervention that will help you here. I feel
your pain my friend.
> Unipad shows the chars, allows a substitution, and enables work to continue.Unipad is indeed a life saver. The only time I find it annoying is when
> Great tool.
it automatically renders numeric character references and "\unnnn"
notation as characters, making it hard to see what you've really got.
For that problem I find Stylus Studio to be quite valuable, as it
handles Unicode very well but doesn't do any automatic
I also use Unipad to convert from Unicode to ASCII with numeric
character references so I can then use Textpad to do regular expression
search and replace.
>I know there are standalone EBCDIC-to-ASCII converters available, but
>>Also, the Java SDK comes with a utility called native2ascii that can
>>convert between any of the encodings that your Java installation
>>supports, which is pretty much all of them you are likely to encounter
>>(unless you're dealing with really old and obscure national language
>>encodings, in which case you've already had to develop the expertise
>>needed to handle encoding issues).
> ??? or the remnants of the 18th Century? We still receive EBCDIC (if that's
> how its spelled )
that is an edge case.
>Not true, although it's encoding detection can be a little weak. But if
> IIRC unipad isn't too clever on inputs? It has gone the clean route
> and said I only want Unicode input.
you know the encoding, you can tell what encoding to use when you open
the file. It's usually pretty obvious if you've guessed wrong.
W. Eliot Kimber