Loading ...
Sorry, an error occurred while loading the content.
 

Re: Output character encoding

Expand Messages
  • Josh Chamas
    ... Hi Arnon, All, I have gone over the thread and been stumped on this for a while. Bottom line it looks like Apache::ASP does not play well with Encode, and
    Message 1 of 9 , Jun 5, 2012
      On 6/5/12 2:02 AM, Arnon Weinberg wrote:
      >
      > How can I set the output character encoding of Apache::ASP output?
      > ...

      Hi Arnon, All,

      I have gone over the thread and been stumped on this for a while. Bottom line
      it looks like Apache::ASP does not play well with Encode, and this seems to me
      to be around the PerlIO interactions and something not quite connecting right on
      a tied file handle. But I do know know the answer to solve this. :(

      To explain where there is some magic at play:

      Apache::ASP::Response does a "use bytes" which is to deal with the output stream
      correctly I believe this is around content length calculations. I think this is
      fine here, and turning this off makes things worse for these examples.

      Apache::ASP::Response is more importantly tied as a file handle when this code
      is run:

      tie *RESPONSE, 'Apache::ASP::Response', $self->{Response};
      select(RESPONSE);

      This is to allow for print to go to $Response->PRINT which aliases to
      $Response->Write. Fundamentally all output is going through $Response->Write at
      the end of the day including the script static content itself.

      What I have found is that this will output the correct bytes in this Apache::ASP
      script:

      <% print STDOUT Encode::decode('ISO-8859-1',"\xE2"); %>

      as it bypasses the tied file handle layer to $Response, so we know perl is
      working at this point!

      but doing this is where we have a problem:

      <% print Encode::decode('ISO-8859-1',"\xE2"); %>

      and immediately in the Apache::ASP::Response::Write() method the data has
      already been converted incorrectly without any processing occurring. Its as if
      by merely going through the tied interface that data goes through some
      conversion process. I have played with various IO settings as in "open ..." and
      various "use" pragmas to no avail but really shooting blind here on what could
      not be working.

      So the way I see it..

      Encoding Magic
      File handle tie Magic <--- data conversion
      Data to $Response->Write

      Encode and perltie seem to have some conflicting bits here.

      If there were some workaround here I would be glad to hear it but I seem to have
      exhausted my ability to troubleshoot this.

      Regards,

      Josh



      > # Latin-1.rasp: #############
      >
      > <%
      > #use open ( ":utf8", ":std" );
      > #binmode ( STDOUT, ":encoding(ISO-8859-1)" );
      >
      > $::Response->{Charset} = "ISO-8859-1";
      >
      > use Encode;
      >
      > print Encode::decode('ISO-8859-1',"\xE2"),
      > Encode::decode('UTF-8',Encode::encode('UTF-8',"\xE2")),
      > "\x{00E2}",
      > chr(0x00E2);
      > %>
      >
      > #############################
      >
      >>asp-perl Latin-1.rasp
      > Content-Type: text/html; charset=ISO-8859-1
      > Content-Length: 6
      > Cache-Control: private
      >
      > ââââ
      >>asp-perl Latin-1.rasp | tail -1 | hexdump
      > 0000000 a2c3 a2c3 e2e2
      > 0000006
      >
      > For some reason, the first 2 test characters are UTF-8 encoded, and the last 2
      > are ISO-8859-1 encoded.
      > How can I get the same results as the CGI script above?
      >
      >

      ---------------------------------------------------------------------
      To unsubscribe, e-mail: asp-unsubscribe@...
      For additional commands, e-mail: asp-help@...
    • Thanos Chatziathanassiou
      Apologies Arnon, I got your original message with the problem description after I had sent mine... ... That rang a bell for me: Read the section ``The UTF8
      Message 2 of 9 , Jun 6, 2012
        Apologies Arnon, I got your original message with the problem
        description after I had sent mine...

        >
        > To explain where there is some magic at play:
        >
        > Apache::ASP::Response does a "use bytes" which is to deal with the
        > output stream correctly I believe this is around content length
        > calculations. I think this is fine here, and turning this off makes
        > things worse for these examples.
        >
        > Apache::ASP::Response is more importantly tied as a file handle when
        > this code is run:
        >
        > tie *RESPONSE, 'Apache::ASP::Response', $self->{Response};
        > select(RESPONSE);
        >
        > This is to allow for print to go to $Response->PRINT which aliases to
        > $Response->Write. Fundamentally all output is going through
        > $Response->Write at the end of the day including the script static
        > content itself.
        >
        > What I have found is that this will output the correct bytes in this
        > Apache::ASP script:
        >
        > <% print STDOUT Encode::decode('ISO-8859-1',"\xE2"); %>
        >
        > as it bypasses the tied file handle layer to $Response, so we know perl
        > is working at this point!
        >
        > but doing this is where we have a problem:
        >
        > <% print Encode::decode('ISO-8859-1',"\xE2"); %>
        >
        > and immediately in the Apache::ASP::Response::Write() method the data
        > has already been converted incorrectly without any processing
        > occurring. Its as if by merely going through the tied interface that
        > data goes through some conversion process. I have played with various
        > IO settings as in "open ..." and various "use" pragmas to no avail but
        > really shooting blind here on what could not be working.
        >
        > So the way I see it..
        >

        That rang a bell for me:
        Read the section ``The UTF8 flag'' in Encode to see the problem.
        ${$Response->{out}} contains a copy of the stuff you're sending to
        $Response->Write(), AKA $Response->WriteRef() but without copying the
        utf-8 flag.
        You can make the example work by simply turning the utf8 flag
        unconditionally on via ``Encode::_utf8_on(${$Response->{out}});''
        after the print statements in Latin-1.rasp.
        Of course, your data should either ALL have the utf8 flag on (eg via
        Encode::decode) or ALL have it off, because ${$Response->{out}} can
        either have it on or off but obviously not both.

        > Encode and perltie seem to have some conflicting bits here.
        >
        > If there were some workaround here I would be glad to hear it but I seem
        > to have exhausted my ability to troubleshoot this.

        I'm not sure there is a generic solution, except perhaps mess around
        with ``is_utf8($$dataref)'' before appending it to $Response->{out} and
        make sure that the same kind of data is appended (either ON or OFF) to
        $Response->{out}.
        See below for why this is a problem

        >
        >> # Latin-1.rasp: #############
        >>
        >> <%
        >> #use open ( ":utf8", ":std" );
        >> #binmode ( STDOUT, ":encoding(ISO-8859-1)" );
        >>
        >> $::Response->{Charset} = "ISO-8859-1";
        >>
        >> use Encode;
        >>
        >> print Encode::decode('ISO-8859-1',"\xE2"),
        >> Encode::decode('UTF-8',Encode::encode('UTF-8',"\xE2")),

        #these will now work if
        #Encode::_utf8_on(${$Response->{out}});
        #is set because they have the flag themselves

        >> "\x{00E2}",
        >> chr(0x00E2);

        #these, on the other hand will not
        #
        #the opposite holds true for
        #Encode::_utf8_off(${$Response->{out}});
        #of course

        >> %>

        I'm sure we can design a ``proper'' solution but not without some
        user-configurable settings and a bit of ugly code.

        Best Regards,
        Thanos Chatziathanassiou



        ---------------------------------------------------------------------
        To unsubscribe, e-mail: asp-unsubscribe@...
        For additional commands, e-mail: asp-help@...
      • Arnon Weinberg
        Thanks very much Josh for investigating this - it saved me some time narrowing down the issue. Even still, I did spend quite a lot of time working out a
        Message 3 of 9 , Jun 14, 2012
          Thanks very much Josh for investigating this - it saved me some time
          narrowing down the issue. Even still, I did spend quite a lot of time
          working out a solution for my needs, and still I don't think it is
          generalizable as-is. However, in case someone else wants to give it a
          crack, I provide details below.

          On 2012-06-05 19:30, Josh Chamas wrote:
          > doing this is where we have a problem:
          >
          > <% print Encode::decode('ISO-8859-1',"\xE2"); %>
          >
          > and immediately in the Apache::ASP::Response::Write() method the data
          > has already been converted incorrectly

          The fact that such a simple use of Encode causes an issue is a little
          surprising. Surely others are using Apache::ASP in multi-language
          environments - is no one using Encode this way? How are others coping
          with this limitation right now?

          > Its as if by merely going through the tied interface that data goes
          > through some conversion process.

          Not quite, as the same results happen without a tie'd interface. The
          "use bytes" pragma is what causes the conversion (see test script below).

          > Apache::ASP::Response does a "use bytes" which is to deal with the
          > output stream correctly I believe this is around content length
          > calculations.
          > I think this is fine here, and turning this off makes things worse for
          > these examples.

          It looks like "use bytes" is now deprecated and should indeed be
          removed. The documentation doesn't mention any trivial substitute.
          However, this pragma mostly just overrides some built-in functions with
          byte-oriented versions. So I made the following changes to Response.pm:
          - changed use bytes => no bytes (just import the namespace)
          - changed all occurrences of length() => bytes::length()
          This resolved the mixed-encoding issue originally posted, but introduced
          a new (more manageable) issue.

          For debugging purposes, I peeked at the "UTF-8 flag" (Perl's internal
          flag that indicates that a string has a known decoding). This flag
          should be transparent in principle, but it helped make sense of the
          behaviour of Apache::ASP.
          Results of testing are summarized as follows:

          1. Testing Perl/CGI, asp-perl, and Apache::ASP, all 3 give the same
          results with the "use bytes" pragma turned on:
          - For any string with the UTF-8 flag off, output is correctly encoded.
          - Any string with the flag on is (double-)encoded as UTF-8, regardless
          of the actual output encoding.
          2. Testing Perl/CGI and asp-perl with "no bytes" produces correct results:
          - The UTF-8 flag does not affect output - it is correctly encoded in
          every case.
          - However, an interesting test case is that of the double-encoding
          problem (see http://ahinea.com/en/tech/perl-unicode-struggle.html). This
          case is indicative of bad code, so is not a concern here, but it
          illustrates how a tie'd filehandle differs from plain STDOUT. In this
          case, a single "wide character" double-encodes the entire output (with
          buffering on, this can be the entire page), instead of just the string.
          - These test cases are demonstrated by the script below.
          3. Testing Apache::ASP with "no bytes" produces different results from
          the command-line (asp-perl) version, as well as different results from
          Perl/CGI running on Apache. This suggests an interaction effect between
          Apache and Apache::ASP (both are required to produce these results).
          - With the UTF-8 flag off, output is correctly encoded as before.
          - However, with "no bytes", Apache::ASP, and the UTF-8 flag on, the
          entire output is double-encoded. This result is similar to the
          double-encoding problem in the previous test case, except that it
          doesn't require a "wide character" - any string with the UTF-8 flag on
          will do.

          This test script demonstrates all but the last test case:

          #!/usr/bin/perl

          use Encode;

          foreach ( "STDOUT", "tie_use_bytes", "tie_no_bytes" )
          {
          print "$_: ";
          tie *FH, $_ if ! /^S/;
          my $STDOUT = select ( FH ) if ! /^S/;
          print "\x{263a}",
          Encode::decode('ISO-8859-1',"\xE2"),
          "\xE2";
          print "\n";
          close ( FH ) if ! /^S/;
          select ( $STDOUT ) if ! /^S/;
          }

          use strict;

          package tie_use_bytes;
          use bytes;

          sub TIEHANDLE { bless {}, shift; }
          sub PRINT { shift()->{out} .= join ( $,, @_ ); }
          sub CLOSE { print STDOUT delete ( shift()->{out} ); }

          package tie_no_bytes;
          no bytes;

          sub TIEHANDLE { bless {}, shift; }
          sub PRINT { shift()->{out} .= join ( $,, @_ ); }
          sub CLOSE { print STDOUT delete ( shift()->{out} ); }

          # Output: ##################

          Wide character in print at ...
          STDOUT: ☺ââ # STDOUT output is correct in all cases
          tie_use_bytes: ☺ââ # with "use bytes", the UTF-8-flagged 2nd character
          is double-encoded
          Wide character in print at ...
          tie_no_bytes: ☺ââ # with "no bytes", the output is correct, but a
          "wide character" double-encodes the entire string because of the way the
          tie'd file handle is implemented

          #########################

          By the way, if it's getting difficult to wrap your head around this,
          you're not alone.

          At this point, I peeked at the $Response->{out} data buffer, and could
          see that it was encoded correctly. However, the output from Apache (when
          the UTF-8 flag is on) was not correct, suggesting that Apache is doing
          something to encode the string in this case.
          I decided therefore to address the problem by turning off the UTF-8
          flag. The most fault-tolerant method I managed to come up with to do
          this was the following:

          ${$Response->{BinaryRef}}
          = Encode::encode ( 'ISO-8859-1', ${$Response->{BinaryRef}},
          sub{ Encode::encode ( 'UTF-8', chr ( shift() ) ) } )
          if ! grep ( /^utf8$/, PerlIO::get_layers ( STDOUT ) );

          which can go at the top of the $Response->Flush() method, or in
          global.asa/Script_OnFlush().

          With this solution I can now modify Apache::ASP's output encoding (eg,
          using binmode ( STDOUT );), as originally desired, and the output
          appears correct in all my test cases.


          --
          -------------------------------------------------------------------------------
          Arnon Weinberg
          www.back2front.ca


          ---------------------------------------------------------------------
          To unsubscribe, e-mail: asp-unsubscribe@...
          For additional commands, e-mail: asp-help@...
        • Warren Young
          ... This answer by Tom Christiansen (yes, the guy who wrote that one book) may shed some light: http://goo.gl/miOFU Here I thought all the Unicode tweaks after
          Message 4 of 9 , Jul 2, 2012
            On 6/5/2012 5:30 PM, Josh Chamas wrote:
            > On 6/5/12 2:02 AM, Arnon Weinberg wrote:
            >>
            >> How can I set the output character encoding of Apache::ASP output?
            >
            > I have gone over the thread and been stumped on this for a while.

            This answer by Tom Christiansen (yes, the guy who wrote that one book)
            may shed some light: http://goo.gl/miOFU

            Here I thought all the Unicode tweaks after 5.8 were minor things, that
            it was all but finished a decade ago.

            Then later, reading chromatic's Modern Perl, he only grudgingly allows
            that 5.12 might be tolerable for some of his Unicode example code, and
            recommends 5.14 instead.

            ---------------------------------------------------------------------
            To unsubscribe, e-mail: asp-unsubscribe@...
            For additional commands, e-mail: asp-help@...
          Your message has been successfully submitted and would be delivered to recipients shortly.