Loading ...
Sorry, an error occurred while loading the content.
 

UTF-8 HOWTO

Expand Messages
  • Warren Young
    I finally got around to converting our Apache::ASP application so that it uses UTF-8 throughout, instead of Latin-1. I learned a few things that aren t
    Message 1 of 2 , Feb 23, 2006
      I finally got around to converting our Apache::ASP application so that
      it uses UTF-8 throughout, instead of Latin-1. I learned a few things
      that aren't discussed in the archives, so I'm setting them down here for
      others to find.

      1. It's best if you use newer Perls. 5.8.0 is adequate, but has known
      bugs in its Unicode handling. When run under 5.8.0, our program
      exhibits a double UTF-8 conversion in one circumstance, while the other
      screens show the data correctly. When the same program is run under
      5.8.5, all screens show the correct data. While it's theoretically
      possible to get Perl 5.6.x to cope with UTF-8 data, I don't recommend
      messing with it. A few years ago when I first tried using UTF-8, I was
      using 5.6 and had many problems with Perl smashing my data back to
      Latin-1 incorrectly.

      2. Also use the newest mod_perl you can. There are known Unicode bugs
      in mod_perl 1.99_09 and older.

      3. You must say "use utf8;" at the top of each ASP file. If you use
      $Response->Include(), each included file also has to say "use utf8;".
      The same goes for any Perl modules you use, if you will be passing UTF-8
      strings through them.

      4. mod_perl doesn't set the LANG environment variable unless you ask it
      to. Perls 5.8 and newer use the LANG environment variable (among other
      things) to decide whether to use UTF-8 by default or not. I didn't find
      it to be necessary to ask mod_perl to set this variable in my program,
      but it can't hurt to do it. If nothing else, it's one less thing you
      have to blame if your pages aren't showing the right data. In your
      httpd.conf, right after "PerlModule Apache::ASP", say "PerlPassEnv
      LANG". This will pass your system's default value for LANG through to
      the mod_perl instances, and thus to Apache::ASP.

      5. Ensure that your data source is passing UTF-8 data correctly. In our
      program, the data comes in via an XML path, so we needed to inform the
      XML parser that the data is UTF-8. Otherwise, the XML parser assumes
      it's Latin-1, and you get a double UTF-8 conversion.

      6. Finally, you need to communicate that the data is UTF-8 to the
      browser. This is done with the Content-Type HTTP header, which you can
      set in a number of ways. I like to do it in a <meta> tag at the top of
      each file that will contain UTF-8 data:

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      Alternately, if all documents on your server should be treated as UTF-8,
      there's an Apache configuration directive to force all output to be
      declared as UTF-8.

      ---------------------------------------------------------------------
      To unsubscribe, e-mail: asp-unsubscribe@...
      For additional commands, e-mail: asp-help@...
    • Warren Young
      ... No, I m not sure. At this point, I just know that there are pages where, if I remove the pragma, the UTF-8 characters get munged. I haven t tried to
      Message 2 of 2 , Feb 27, 2006
        Joshua Chamas wrote:
        > Do you know why it is that "use utf8" is needed
        > at the top of each script?

        No, I'm not sure. At this point, I just know that there are pages
        where, if I remove the pragma, the UTF-8 characters get munged. I
        haven't tried to localize the Perl constructs in which this happens.

        > What precisely were the problems that you were running into without this
        > setting?

        The most common symptom was what looked like double UTF-8 encodings.
        That is, Unicode characters that should have encoded as 2 bytes in UTF-8
        were showing up as 4 bytes. I didn't try to reverse the double
        conversion to make sure this is what was happening, but I can't think of
        a more likely explanation for the symptom.

        > The opportunity here is that we could automatically add something like this
        > to the top of each page.

        I'll consider investigating deeper.

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: asp-unsubscribe@...
        For additional commands, e-mail: asp-help@...
      Your message has been successfully submitted and would be delivered to recipients shortly.