Loading ...
Sorry, an error occurred while loading the content.
 

Re: [XSL-FO] Managing Character Sets (was question .....warning signs!!)

Expand Messages
  • W. Eliot Kimber
    ... But this may be a bad data issue, not an encoding issue in many cases: that is, the file is a UTF-8 file but happens to have a bad byte in it somewhere,
    Message 1 of 5 , Oct 28, 2003
      Dave Pawson wrote:

      > At 15:30 28/10/2003, you wrote:
      >
      >
      >>If the input to your process is always XML then encoding shouldn't be an
      >>issue *as long as* your XML processor can handle the input encodings.
      >>This is because XML documents are required to declare their encoding if
      >>it is not UTF-8, therefore there should never be any question as to the
      >>encoding of a document on input.
      >
      >
      > That may be a requirement, but its a considerate supplier who provides it Eliot
      > Too many apps fall over on one odd character in 10000, which is .... not
      > easy to find,
      > yet totally stops the application.
      > Hence the caution.

      But this may be a bad data issue, not an encoding issue in many cases:
      that is, the file is a UTF-8 file but happens to have a bad byte in it
      somewhere, for whatever reason, or is a valid UTF-8 file but has a
      non-XML character (remember that XML excludes a few Unicode characters,
      although now I can't remember why). In that case you have to work with
      your data supplier to figure out why the problem is occurring.

      I understand the issue--I've had it myself, but my point is that it's
      nothing to do with XSL-FO and whether or not it's an appropriate technology.

      This is an issue of supplier/client management, for which we can offer
      no help on this list.

      > And plead with suppiers to take the same care, and hope their applications
      > dont' transcode on the way through (which I believe some MS tools do?)

      I avoid MS tools whenever possible, so I wouldn't know. But this should
      not be an issue for conforming XML processors. Before you get to XML
      you're on your own :-)

      > I've found SC Unipad a good buy, almost solely for this purpose.
      > Export as text from Adobe reader 6, and try to markup.

      If you're doing this, you are not getting paid enough. But this is a
      data conversion issue, not an XML or XSL-FO issue and there's really
      nothing short of divine intervention that will help you here. I feel
      your pain my friend.

      > Unipad shows the chars, allows a substitution, and enables work to continue.
      > Great tool.

      Unipad is indeed a life saver. The only time I find it annoying is when
      it automatically renders numeric character references and "\unnnn"
      notation as characters, making it hard to see what you've really got.
      For that problem I find Stylus Studio to be quite valuable, as it
      handles Unicode very well but doesn't do any automatic
      reference-to-glyph mapping.

      I also use Unipad to convert from Unicode to ASCII with numeric
      character references so I can then use Textpad to do regular expression
      search and replace.

      >
      >
      >>Also, the Java SDK comes with a utility called native2ascii that can
      >>convert between any of the encodings that your Java installation
      >>supports, which is pretty much all of them you are likely to encounter
      >>(unless you're dealing with really old and obscure national language
      >>encodings, in which case you've already had to develop the expertise
      >>needed to handle encoding issues).
      >
      >
      > ??? or the remnants of the 18th Century? We still receive EBCDIC (if that's
      > how its spelled )

      I know there are standalone EBCDIC-to-ASCII converters available, but
      that is an edge case.

      >
      > IIRC unipad isn't too clever on inputs? It has gone the clean route
      > and said I only want Unicode input.

      Not true, although it's encoding detection can be a little weak. But if
      you know the encoding, you can tell what encoding to use when you open
      the file. It's usually pretty obvious if you've guessed wrong.

      Cheers,

      E.
      --
      W. Eliot Kimber
      Innodata Isogen
      eliot@...
      www.isogen.com
    Your message has been successfully submitted and would be delivered to recipients shortly.