Loading ...
Sorry, an error occurred while loading the content.

Problem with national characters in response after upgrading to Perl 5.8

Expand Messages
  • Anund Lie
    Hi, After upgrading to Perl 5.8, I ve got problems with non-ASCII characters in SOAP responses. The problem appears to be another (the other?) occurrence of
    Message 1 of 1 , Apr 10, 2003
      After upgrading to Perl 5.8, I've got problems with non-ASCII
      characters in SOAP responses. The problem appears to be
      another (the other?) occurrence of one that's already worked
      around in SOAP::Transport::HTTP::send_receive, and the immediate
      fix is also the same.

      The problem is:
      - The XML content string to be output to the transport
      may arrive at the transport in UTF-8 form. In that case,
      the expectation is that the bytes making up the UTF-8
      encoded form of the string is to be output.

      - However, the UTF-8 flag is on, and since character
      semantics apply, the length of the string is computed
      as the number of characters, not the number of bytes.
      Consequently, the computed content-length for HTTP is
      wrong (in Perl 5.6.1).

      - In Perl 5.8, things are even worse: On output, with
      binmode in effect, the UTF-8 encoding is decoded.
      The net effect is to send the content with ISO-8859-1
      encoding instead.

      The problem is worked around by re-packing the content
      string ($envelope) to drop the UTF-8 flag, before
      generating the HTTP request in the send_receive method:

      # what to do? we calculate proper content-length (using
      # bytelength() function from SOAP::Utils) and then drop utf8 mark
      # from string (doing pack with 'C0A*' modifier) if length and
      # bytelength are not the same
      my $bytelength = SOAP::Utils::bytelength($envelope);
      $envelope = pack('C0A*', $envelope)
      if !$SOAP::Constants::DO_NOT_USE_LWP_LENGTH_HACK &&
      length($envelope) != $bytelength;

      This solves the problem with content length, and,
      incidentally, also the problem with emitting the wrong
      bytes, which did not surface until Perl 5.8.

      The problem I ran into is the same, but applies to the HTTP
      server generating its response instead of the HTTP client
      generating its request. The simplest fix appears to be
      to add the same code to re-pack the response string in
      SOAP::Transport::HTTP::Server::make_response, i.e. at the
      start of make_response insert:

      my $bytelength = SOAP::Utils::bytelength($response);
      $response = pack('C0A*', $response)
      if !$SOAP::Constants::DO_NOT_USE_LWP_LENGTH_HACK &&
      length($response) != $bytelength;

      The same problem might apply to other transports that I
      haven't tested.

      However, I think the fixes are in the wrong place, and I'm
      also not sure it is right to blame LWP. The basic problem
      is that the distinction between character and byte semantics
      for Perl strings introduces an ambiguity: When an UTF-8 string
      occurs, does one mean the character string (forget about the
      UTF-8 encoding), or does one mean the actual bytes of the
      UTF-8 encoding? Most of the time, the first interpretation
      is the intention, but in some cases, notably in conjunction
      with IO and network protocols, one is interested in the
      encoded version of the string, and that is surely the case

      Perl's preference is for character semantics, except
      that input/output in 5.6.1 breaks that and falls back
      to byte semantics. This is perhaps the real reason
      for the confusion.

      In 5.8, the preference for character semantics is more
      firm, and the problem case above is actually handled
      in a consistent (though wrong) way: The content length
      as computed is equal to the number of bytes emitted,
      but the actual bytes emitted are wrong (ISO-8859-1
      instead of UTF-8 encoding).

      The proper fix is to make it unambiguous that the
      intention is the emit the bytes of the UTF-8 encoding,
      not the characters. The current fix is to drop
      the UTF-8 flag from the string without changing the
      bytes (the encoding). However, doing this in the
      transport handler just fixes the symptom.
      (It also limits the set of encodings that can
      be handled to UTF-8, US-ASCII and the native
      8-bit charset, with major trouble to be
      expected if that is not ISO-8859-1.)

      The first issue to decide whether the SOAP
      data should be passed to SOAP::Lite in character
      or encoded form. My preference is clearly for
      character semantics, and I think that is the
      intention of the current version also, but it
      is a bit ambiguous for the time being: UTF-8
      strings work as expected, but it is not clear
      whether that is because they are interpreted
      with character semantics or because byte
      semantics apply and the default encoding is UTF-8.

      Given character semantics on input, the next step
      is to determine which component of SOAP::Lite is
      in charge of encoding the string:

      1) The serializer: After all, it does generate the
      encoding attribute in the XML declaration. In that
      case, SOAP::Serializer::xmlize should assume character
      semantics for all its input strings and return
      encoded strings (without the UTF-8 flag) on its
      output. The tests in the current fixes would never
      be true, and that code could be removed.

      2) The transport driver: This would better accommodate
      transport libraries that do their own encoding
      and decoding. For instance, PerlIO layers
      could be used directly on socket and file handles.
      In this case, the send_receive and handle methods
      receive character strings, not byte strings.

      OK, so much for my ramblings... I hope at least
      the simple fix can make it back into a future version
      of SOAP::Lite. As for the more long-term fix:
      If there's sufficient interest, some consensus that
      this is a sensible thing to do and no duplication
      of work someone else has already started, I might
      go ahead and make a first cut at it.

      - Anund
    Your message has been successfully submitted and would be delivered to recipients shortly.