Problem with national characters in response after upgrading to Perl 5.8
After upgrading to Perl 5.8, I've got problems with non-ASCII
characters in SOAP responses. The problem appears to be
another (the other?) occurrence of one that's already worked
around in SOAP::Transport::HTTP::send_receive, and the immediate
fix is also the same.
The problem is:
- The XML content string to be output to the transport
may arrive at the transport in UTF-8 form. In that case,
the expectation is that the bytes making up the UTF-8
encoded form of the string is to be output.
- However, the UTF-8 flag is on, and since character
semantics apply, the length of the string is computed
as the number of characters, not the number of bytes.
Consequently, the computed content-length for HTTP is
wrong (in Perl 5.6.1).
- In Perl 5.8, things are even worse: On output, with
binmode in effect, the UTF-8 encoding is decoded.
The net effect is to send the content with ISO-8859-1
The problem is worked around by re-packing the content
string ($envelope) to drop the UTF-8 flag, before
generating the HTTP request in the send_receive method:
# what to do? we calculate proper content-length (using
# bytelength() function from SOAP::Utils) and then drop utf8 mark
# from string (doing pack with 'C0A*' modifier) if length and
# bytelength are not the same
my $bytelength = SOAP::Utils::bytelength($envelope);
$envelope = pack('C0A*', $envelope)
if !$SOAP::Constants::DO_NOT_USE_LWP_LENGTH_HACK &&
length($envelope) != $bytelength;
This solves the problem with content length, and,
incidentally, also the problem with emitting the wrong
bytes, which did not surface until Perl 5.8.
The problem I ran into is the same, but applies to the HTTP
server generating its response instead of the HTTP client
generating its request. The simplest fix appears to be
to add the same code to re-pack the response string in
SOAP::Transport::HTTP::Server::make_response, i.e. at the
start of make_response insert:
my $bytelength = SOAP::Utils::bytelength($response);
$response = pack('C0A*', $response)
if !$SOAP::Constants::DO_NOT_USE_LWP_LENGTH_HACK &&
length($response) != $bytelength;
The same problem might apply to other transports that I
However, I think the fixes are in the wrong place, and I'm
also not sure it is right to blame LWP. The basic problem
is that the distinction between character and byte semantics
for Perl strings introduces an ambiguity: When an UTF-8 string
occurs, does one mean the character string (forget about the
UTF-8 encoding), or does one mean the actual bytes of the
UTF-8 encoding? Most of the time, the first interpretation
is the intention, but in some cases, notably in conjunction
with IO and network protocols, one is interested in the
encoded version of the string, and that is surely the case
Perl's preference is for character semantics, except
that input/output in 5.6.1 breaks that and falls back
to byte semantics. This is perhaps the real reason
for the confusion.
In 5.8, the preference for character semantics is more
firm, and the problem case above is actually handled
in a consistent (though wrong) way: The content length
as computed is equal to the number of bytes emitted,
but the actual bytes emitted are wrong (ISO-8859-1
instead of UTF-8 encoding).
The proper fix is to make it unambiguous that the
intention is the emit the bytes of the UTF-8 encoding,
not the characters. The current fix is to drop
the UTF-8 flag from the string without changing the
bytes (the encoding). However, doing this in the
transport handler just fixes the symptom.
(It also limits the set of encodings that can
be handled to UTF-8, US-ASCII and the native
8-bit charset, with major trouble to be
expected if that is not ISO-8859-1.)
The first issue to decide whether the SOAP
data should be passed to SOAP::Lite in character
or encoded form. My preference is clearly for
character semantics, and I think that is the
intention of the current version also, but it
is a bit ambiguous for the time being: UTF-8
strings work as expected, but it is not clear
whether that is because they are interpreted
with character semantics or because byte
semantics apply and the default encoding is UTF-8.
Given character semantics on input, the next step
is to determine which component of SOAP::Lite is
in charge of encoding the string:
1) The serializer: After all, it does generate the
encoding attribute in the XML declaration. In that
case, SOAP::Serializer::xmlize should assume character
semantics for all its input strings and return
encoded strings (without the UTF-8 flag) on its
output. The tests in the current fixes would never
be true, and that code could be removed.
2) The transport driver: This would better accommodate
transport libraries that do their own encoding
and decoding. For instance, PerlIO layers
could be used directly on socket and file handles.
In this case, the send_receive and handle methods
receive character strings, not byte strings.
OK, so much for my ramblings... I hope at least
the simple fix can make it back into a future version
of SOAP::Lite. As for the more long-term fix:
If there's sufficient interest, some consensus that
this is a sensible thing to do and no duplication
of work someone else has already started, I might
go ahead and make a first cut at it.