Loading ...
Sorry, an error occurred while loading the content.

Re: [XSL-FO] Managing Character Sets (was question .....warning signs!!)

Expand Messages
  • W. Eliot Kimber
    ... But this may be a bad data issue, not an encoding issue in many cases: that is, the file is a UTF-8 file but happens to have a bad byte in it somewhere,
    Message 1 of 5 , Oct 28, 2003
      Dave Pawson wrote:

      > At 15:30 28/10/2003, you wrote:
      >>If the input to your process is always XML then encoding shouldn't be an
      >>issue *as long as* your XML processor can handle the input encodings.
      >>This is because XML documents are required to declare their encoding if
      >>it is not UTF-8, therefore there should never be any question as to the
      >>encoding of a document on input.
      > That may be a requirement, but its a considerate supplier who provides it Eliot
      > Too many apps fall over on one odd character in 10000, which is .... not
      > easy to find,
      > yet totally stops the application.
      > Hence the caution.

      But this may be a bad data issue, not an encoding issue in many cases:
      that is, the file is a UTF-8 file but happens to have a bad byte in it
      somewhere, for whatever reason, or is a valid UTF-8 file but has a
      non-XML character (remember that XML excludes a few Unicode characters,
      although now I can't remember why). In that case you have to work with
      your data supplier to figure out why the problem is occurring.

      I understand the issue--I've had it myself, but my point is that it's
      nothing to do with XSL-FO and whether or not it's an appropriate technology.

      This is an issue of supplier/client management, for which we can offer
      no help on this list.

      > And plead with suppiers to take the same care, and hope their applications
      > dont' transcode on the way through (which I believe some MS tools do?)

      I avoid MS tools whenever possible, so I wouldn't know. But this should
      not be an issue for conforming XML processors. Before you get to XML
      you're on your own :-)

      > I've found SC Unipad a good buy, almost solely for this purpose.
      > Export as text from Adobe reader 6, and try to markup.

      If you're doing this, you are not getting paid enough. But this is a
      data conversion issue, not an XML or XSL-FO issue and there's really
      nothing short of divine intervention that will help you here. I feel
      your pain my friend.

      > Unipad shows the chars, allows a substitution, and enables work to continue.
      > Great tool.

      Unipad is indeed a life saver. The only time I find it annoying is when
      it automatically renders numeric character references and "\unnnn"
      notation as characters, making it hard to see what you've really got.
      For that problem I find Stylus Studio to be quite valuable, as it
      handles Unicode very well but doesn't do any automatic
      reference-to-glyph mapping.

      I also use Unipad to convert from Unicode to ASCII with numeric
      character references so I can then use Textpad to do regular expression
      search and replace.

      >>Also, the Java SDK comes with a utility called native2ascii that can
      >>convert between any of the encodings that your Java installation
      >>supports, which is pretty much all of them you are likely to encounter
      >>(unless you're dealing with really old and obscure national language
      >>encodings, in which case you've already had to develop the expertise
      >>needed to handle encoding issues).
      > ??? or the remnants of the 18th Century? We still receive EBCDIC (if that's
      > how its spelled )

      I know there are standalone EBCDIC-to-ASCII converters available, but
      that is an edge case.

      > IIRC unipad isn't too clever on inputs? It has gone the clean route
      > and said I only want Unicode input.

      Not true, although it's encoding detection can be a little weak. But if
      you know the encoding, you can tell what encoding to use when you open
      the file. It's usually pretty obvious if you've guessed wrong.


      W. Eliot Kimber
      Innodata Isogen
    Your message has been successfully submitted and would be delivered to recipients shortly.