Loading ...
Sorry, an error occurred while loading the content.

Re: Method calls failing due to malformed utf-8?

Expand Messages
  • Eric Promislow
    It s failing because the character with value 3 is an invalid XML character, even if it s encoded as a character reference. From the XML spec: Character Range
    Message 1 of 4 , Dec 14, 2004
    • 0 Attachment
      It's failing because the character with value 3 is an invalid XML character,
      even if it's encoded as a character reference.

      From the XML spec:

      Character Range
      [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
      [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character,
      excluding the surrogate blocks, FFFE, and FFFF. */

      Fine, you say, I'll just use a character reference to represent a character
      outside this range. Not so fast:

      4.1 Character and Entity References

      [Definition: A character reference refers to a specific character in
      the ISO/IEC 10646 character set, for example one not directly
      accessible from available input devices.]
      Character Reference
      [66] CharRef ::= '&#' [0-9]+ ';'
      | '&#x' [0-9a-fA-F]+ ';' [WFC: Legal Character]

      Well-formedness constraint: Legal Character

      Characters referred to using character references MUST match the
      production for Char.

      This means that there is no way to represent characters that do not fall in the
      range specified in production [2] within XML. You need to use an external
      encoding, like base64 (utf-8 would be an internal encoding, as the parser
      is processing it).

      Unfortunately this breaks the loosely coupled nature of SOAP.

      One solution is to tell SOAP::Lite to use a parser that allows non-xml-chars
      if encoded as character references, like the .Net 1.0 and 1.1 parsers
      do (apparently
      this will be fixed in Whidbey).

      Another is to grab the code before SOAP hands it to the parser for
      deserializing, and do a pass over it, something like:

      $data =~ s/&(#x0*1?.;)/&$1/g;
      $data =~ s/&(#0*[12]?\d;)/&$1/g;
      $data =~ s/&(#0*3[01])/&$1/g;

      and then post-process this code after the XML parser returns it to
      your application.

      - Eric (making my annual foray into this list).

      > Hi Group - boy am I glad I've found you guys!

      > I'm interfacing to a third party web service using SOAP::Lite and
      > have run into a problem with my method calls failing on certain
      > requests.

      > The code fragment I use is like this;

      > my $method = SOAP::Data->name('GetInformation') ->attr({xmlns
      > => 'http://someservice/soap/'});

      > my $result = $soap->call($method => @params);

      > and it is this method call that is sometimes failing - depending on
      > the parameters I pass. Oh the 3rd party is a .NET service but I
      > believe I have accomodated that correctly as the system seems to
      > work OK most of the time.

      > The error I get is;

      > reference to invalid character number at line 1, column 5267, byte
      > 5267 at E:/Perl/site/lib/XML/Parser.pm line 187

      > The XML returned begins like this;

      > <?xml version="1.0" encoding="utf-8"?><soap:Envelope
      > xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
      > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      > xmlns:xsd="http://www.w3.org/2001/XMLSchema"><soap:Body>

      > I am assuming - perhaps rashly - that this is because the XML
      > document contains a character which isn't UTF-8?

      > I note that in the area where the parser complains, I have a line
      > that looks like this;

      > <Title>Eternity Ring 12,990</Title>

      > Firstly, is my diagnosis correct? I'm guessing that  isn't a
      > valid UTF-8 character - or am I wrong here. I'm a bit out of my
      > depth.

      > Secondly (if my assumption is true) - whilst I can report this to
      > the service provider and hope they clean their data, in the meantime
      > is there a workaround for this that allows me to relax the rules
      > that the soap::lite parser is enforcing. At present I just put an
      > eval block around the method call to trap the error and stop
      > additional processing of this item - but I'd really prefer to be
      > able to process the data anyway.

      > Any clues, links, advice much appreciated.

      > Many Thanks
      > Roger
      > UK

      > PS. My system is Win2003 server / Perl 5.8.3 build 809 / SOAP-LITE
      > 0.55
    • eric-amick@comcast.net
      ... You re close. It s perfectly valid Unicode (UTF-8 is just a method of encoding Unicode); the problem is that the XML standard does not require that
      Message 2 of 4 , Dec 15, 2004
      • 0 Attachment

        > I am assuming - perhaps rashly - that this is because the XML
        > document contains a character which isn't UTF-8?
        >
        > I note that in the area where the parser complains, I have a

        >line that looks like this;
        >
        ><Title>Eternity Ring &#x3; 12,990</Title>
        >
        > Firstly, is my diagnosis correct? I'm guessing that &#x3; isn't a
        > valid UTF-8 character - or am I wrong here. I'm a bit out of my
        > depth.

        You're close. It's perfectly valid Unicode (UTF-8 is just a method of encoding Unicode); the problem is that the XML standard does not require that character to be accepted by XML processors.


        > Secondly (if my assumption is true) - whilst I can report this to
        > the service provider and hope they clean their data, in the meantime
        > is there a workaround for this that allows me to relax the rules
        > that the soap::lite parser is enforcing. At present I just put an
        > eval block around the method call to trap the error and stop
        > additional processing of this item - but I'd really prefer to be
        > able to process the data anyway.

        You could probably modify SOAP::Lite to allow all characters in the range \x00-\x1f, but I have no idea if it would break anything else.
         
        --
        Eric Amick
        Columbia, MD
      • Roger
        Ooops, I posted this reply about 8 hours ago but think I accidentally posted it to the author in error. Apologies to Eric and the group. Here s what I meant to
        Message 3 of 4 , Dec 15, 2004
        • 0 Attachment
          Ooops, I posted this reply about 8 hours ago but think I
          accidentally posted it to the author in error. Apologies to Eric and
          the group. Here's what I meant to say!

          <eric.promislow@g...> wrote:

          > Another [solution] is to grab the code before SOAP hands it to the
          parser for
          > deserializing, and do a pass over it, something like:
          >
          > $data =~ s/&(#x0*1?.;)/&$1/g;
          > $data =~ s/&(#0*[12]?\d;)/&$1/g;
          > $data =~ s/&(#0*3[01])/&$1/g;
          >
          > and then post-process this code after the XML parser returns it to
          > your application.
          >

          Yes, I like this plan - but I fear it is beyond my soap::lite
          knowledge. From my googling exploits, I imagine it has something to
          do with SOAP::Transport::HTTP::Client ... but does anyone know HOW I
          can intercept the XML as Eric suggests and then pass on the
          translated version to the parser?

          Any clues appreciated.

          Roger
          London, UK
        Your message has been successfully submitted and would be delivered to recipients shortly.