1606Semi OT: UTF-8 handling
- Feb 25, 2004It seems that the UTF-8 support in Perl is still transitional. By that
I mean that there are situations where you can find strings being
converted back and forth between UTF-8 and the local character set
(Latin-1 in my case) several times as it passes through the system.
Here's a chain I've observed on one of my machines:
DB -> daemon -> HTTP -> ASP -> Browser
Latin-1 UTF-8 Latin-1 UTF-8
(View with a fixed-space font.)
DB is a special-purpose database we use; there's some Latin-1 encoded
data in it.
daemon is a background process written in Perl that sits between the
special database and the Apache::ASP code. When it pulls the data in
from the database, Perl upconverts the data to UTF-8 on systems like Red
Hat Linux 9 where the LANG variable is set to something like en_US.UTF-8.
The daemon uses HTTP::Daemon to interface with the ASP code. We do it
this way for reasons that aren't germane to the discussion. What's
important is that in the ASP code, the LANG variable is unset for
whatever reason. Therefore, Perl seems to convert the UTF-8 encoded
data back into Latin-1, probably within the HTTP parsing code. It's
clear, at least, that it's in Latin-1 throughout the ASP processing.
The data finally seems to be converted back to UTF-8 by Apache before
sending it off to the browser. Presumably this is because modern
browsers advertise UTF-8 support.
Right now, we're coping okay with these conversions. The only concern
is that the conversion from UTF-8 back to Latin-1 is unnecessary. Some
day, we wight decide to go in and force things to maintain the data in
UTF-8 all the way through the chain beyond that first conversion, for
efficiency. Does anyone know how we can force Perl to keep the data in
UTF-8 format, even when the LANG variable isn't set?
Incidentally, we see a different conversion chain on Red Hat 7.2, which
uses Perl 5.6.1 and Apache 1.3. The data seems to stay in Latin-1 until
sometime within the ASP code, where it's converted to UTF-8. Very
strange, but since the last conversion is the only one that matters to
our code, it works out for the best in our case. Just FYI, for the mail
archive diggers. :)
To unsubscribe, e-mail: asp-unsubscribe@...
For additional commands, e-mail: asp-help@...