Loading ...
Sorry, an error occurred while loading the content.

1606Semi OT: UTF-8 handling

Expand Messages
  • Warren Young
    Feb 25, 2004
      It seems that the UTF-8 support in Perl is still transitional. By that
      I mean that there are situations where you can find strings being
      converted back and forth between UTF-8 and the local character set
      (Latin-1 in my case) several times as it passes through the system.

      Here's a chain I've observed on one of my machines:

      DB -> daemon -> HTTP -> ASP -> Browser
      Latin-1 UTF-8 Latin-1 UTF-8

      (View with a fixed-space font.)

      DB is a special-purpose database we use; there's some Latin-1 encoded
      data in it.

      daemon is a background process written in Perl that sits between the
      special database and the Apache::ASP code. When it pulls the data in
      from the database, Perl upconverts the data to UTF-8 on systems like Red
      Hat Linux 9 where the LANG variable is set to something like en_US.UTF-8.

      The daemon uses HTTP::Daemon to interface with the ASP code. We do it
      this way for reasons that aren't germane to the discussion. What's
      important is that in the ASP code, the LANG variable is unset for
      whatever reason. Therefore, Perl seems to convert the UTF-8 encoded
      data back into Latin-1, probably within the HTTP parsing code. It's
      clear, at least, that it's in Latin-1 throughout the ASP processing.

      The data finally seems to be converted back to UTF-8 by Apache before
      sending it off to the browser. Presumably this is because modern
      browsers advertise UTF-8 support.

      Right now, we're coping okay with these conversions. The only concern
      is that the conversion from UTF-8 back to Latin-1 is unnecessary. Some
      day, we wight decide to go in and force things to maintain the data in
      UTF-8 all the way through the chain beyond that first conversion, for
      efficiency. Does anyone know how we can force Perl to keep the data in
      UTF-8 format, even when the LANG variable isn't set?

      Incidentally, we see a different conversion chain on Red Hat 7.2, which
      uses Perl 5.6.1 and Apache 1.3. The data seems to stay in Latin-1 until
      sometime within the ASP code, where it's converted to UTF-8. Very
      strange, but since the last conversion is the only one that matters to
      our code, it works out for the best in our case. Just FYI, for the mail
      archive diggers. :)


      To unsubscribe, e-mail: asp-unsubscribe@...
      For additional commands, e-mail: asp-help@...