Loading ...
Sorry, an error occurred while loading the content.
 

UTF-8 problems ....

Expand Messages
  • Thanos Chatziathanassiou
    I have an interesting problem with UTF-8 charset and Apache::ASP (possibly): I had to construct some xml with Apache::ASP some time ago, but due to a switch, I
    Message 1 of 9 , Mar 11, 2004
      I have an interesting problem with UTF-8 charset and Apache::ASP (possibly):

      I had to construct some xml with Apache::ASP some time ago, but due to a
      switch, I now want the resulting file to be UTF-8 encoded instead of
      ISO-8859-7 that it was until now.
      This is all fine, however the data I'm reading from the database are
      also in ISO-8859-7.
      So I used Script_OnFlush like this:

      $$ref =~ s|([\xB8-\xFE])|chr(ord($1)+0x02D0)|sge;
      (BTW I also tried ``use Encode;'' and ``use encoding "iso-8859-7";'' of
      perl-5.8 with quite the same results)
      in order to convert all greek characters from iso-8859-7 to utf-8 just
      before flushing the output to the client.

      The problem is that although I can verify that $$ref contains what I
      want (I also printed it to a file just to make sure), the output to any
      client (Mozilla, Opera, IE, XML Spy, whatever) is truncated and
      obviously not valid.
      I've actually used Ethereal to sniff the data on the network and they
      seem to be valid and the charset checks out ok.

      Anyone have an idea what might be wrong ?


      ---------------------------------------------------------------------
      To unsubscribe, e-mail: asp-unsubscribe@...
      For additional commands, e-mail: asp-help@...
    • Josh Chamas
      ... Maybe the Content-Length field is not calculated correctly by Apache::ASP? If its too short, this could be a problem. To have Apache::ASP not calculate
      Message 2 of 9 , Mar 11, 2004
        Thanos Chatziathanassiou wrote:
        > I have an interesting problem with UTF-8 charset and Apache::ASP
        > (possibly):
        >
        > I had to construct some xml with Apache::ASP some time ago, but due to a
        > switch, I now want the resulting file to be UTF-8 encoded instead of
        > ISO-8859-7 that it was until now.
        > This is all fine, however the data I'm reading from the database are
        > also in ISO-8859-7.
        > So I used Script_OnFlush like this:
        >
        > $$ref =~ s|([\xB8-\xFE])|chr(ord($1)+0x02D0)|sge;
        > (BTW I also tried ``use Encode;'' and ``use encoding "iso-8859-7";'' of
        > perl-5.8 with quite the same results)
        > in order to convert all greek characters from iso-8859-7 to utf-8 just
        > before flushing the output to the client.
        >
        > The problem is that although I can verify that $$ref contains what I
        > want (I also printed it to a file just to make sure), the output to any
        > client (Mozilla, Opera, IE, XML Spy, whatever) is truncated and
        > obviously not valid.
        > I've actually used Ethereal to sniff the data on the network and they
        > seem to be valid and the charset checks out ok.
        >

        Maybe the Content-Length field is not calculated correctly by Apache::ASP?
        If its too short, this could be a problem. To have Apache::ASP not calculate
        the content length, try to flush when the script first starts, that will flush
        the headers, and see if there is a difference.

        Regards,

        Josh

        ________________________________________________________________________
        Josh Chamas, Founder | NodeWorks - http://www.nodeworks.com
        Chamas Enterprises Inc. | NodeWorks Directory - http://dir.nodeworks.com
        http://www.chamas.com | Apache::ASP - http://www.apache-asp.org



        ---------------------------------------------------------------------
        To unsubscribe, e-mail: asp-unsubscribe@...
        For additional commands, e-mail: asp-help@...
      • Warren Young
        ... The problem is that the ASP code runs with the LANG environment variable unset. (I m not sure if it s Apache or mod_perl doing this.) In this situation,
        Message 3 of 9 , Mar 11, 2004
          Thanos Chatziathanassiou wrote:
          > I have an interesting problem with UTF-8 charset and Apache::ASP

          The problem is that the ASP code runs with the LANG environment variable
          unset. (I'm not sure if it's Apache or mod_perl doing this.) In this
          situation, the Perl interpreter runs without Unicode support. I posted
          about a similar problem on the 25th of last month, which you may find
          enlightening.

          I'm not sure what the right solution to the problem is. There are a
          number of things that could be done. In order of my preference:

          1. Find out who is unsetting LANG, and make 'em stop it.

          2. Convert your database to UTF-8. In this case, Perl will pass the
          data without change, since it doesn't change >128 characters when
          running in Unicode-free mode.

          3. Find another way besides the LANG variable to ask Perl to enable its
          Unicode support. Perhaps there's a compile-time option?

          4. Somehow force LANG to be set properly for your locale. I'm not sure
          if you can set this early enough to make the Perl interpreter see it.

          5. Do the charset conversion by hand. The disadvantage to this approach
          is that it's easy to tie your code to one locale, making it nonportable.

          ---------------------------------------------------------------------
          To unsubscribe, e-mail: asp-unsubscribe@...
          For additional commands, e-mail: asp-help@...
        • Josh Chamas
          ... mod_perl notoriously does not set up %ENV correctly. In order for this to happen, one must use PerlPassEnv LANG, or use PerlSetEnv LANG $LANG in the
          Message 4 of 9 , Mar 11, 2004
            Warren Young wrote:
            > Thanos Chatziathanassiou wrote:
            >
            >> I have an interesting problem with UTF-8 charset and Apache::ASP
            >
            >
            > The problem is that the ASP code runs with the LANG environment variable
            > unset. (I'm not sure if it's Apache or mod_perl doing this.) In this
            > situation, the Perl interpreter runs without Unicode support. I posted
            > about a similar problem on the 25th of last month, which you may find
            > enlightening.
            >
            > I'm not sure what the right solution to the problem is. There are a
            > number of things that could be done. In order of my preference:
            >
            > 1. Find out who is unsetting LANG, and make 'em stop it.
            >

            mod_perl notoriously does not set up %ENV correctly. In order for
            this to happen, one must use PerlPassEnv LANG, or use PerlSetEnv LANG $LANG
            in the httpd.conf.

            Does this resolve the UTF8 issues?

            Regards,

            Josh

            ________________________________________________________________________
            Josh Chamas, Founder | NodeWorks - http://www.nodeworks.com
            Chamas Enterprises Inc. | NodeWorks Directory - http://dir.nodeworks.com
            http://www.chamas.com | Apache::ASP - http://www.apache-asp.org



            ---------------------------------------------------------------------
            To unsubscribe, e-mail: asp-unsubscribe@...
            For additional commands, e-mail: asp-help@...
          • Thanos Chatziathanassiou
            ... Sure enough, the Content-Length header was miscalculated. Since I had the ethereal capture handy, I could verify it rather easily (although had I been
            Message 5 of 9 , Mar 12, 2004
              Josh Chamas wrote:

              > Maybe the Content-Length field is not calculated correctly by
              > Apache::ASP?
              > If its too short, this could be a problem. To have Apache::ASP not
              > calculate
              > the content length, try to flush when the script first starts, that
              > will flush
              > the headers, and see if there is a difference.
              >
              > Regards,
              >
              > Josh
              >
              Sure enough, the Content-Length header was miscalculated. Since I had
              the ethereal capture handy, I could verify it rather easily (although
              had I been careful, I would have been able to figure it out myself -
              thanks for pointing it out).
              A ``$Response->Flush();'' fixed things for me.
              Still, in the beginning I wasn't using Script_OnFlush, I used a regular
              global.asa sub on the database data directly and I still got the same
              problem. Then I switched to Script_OnFlush to save myself the trouble of
              changing some 200+ files one by one.
              As far as the LANG is concerned, it has been correctly set to el_GR
              (meaning iso-8869-7) for as long as I can remember.
              I recall having problems to set it, because I specifically created an
              asp page telling whether a few greek words contain \w characters and to
              convert them to upper case, just to see it works. I see now that I had
              to set $ENV{'LANG'} in startup.pl to get it to work correctly.

              Thanks again for the quick response.

              Regards,
              Thanos Chatziathanassiou


              ---------------------------------------------------------------------
              To unsubscribe, e-mail: asp-unsubscribe@...
              For additional commands, e-mail: asp-help@...
            • Thanos Chatziathanassiou
              ... Sure enough, the Content-Length header was miscalculated. Since I had the ethereal capture handy, I could verify it rather easily (although had I been
              Message 6 of 9 , Mar 12, 2004
                Josh Chamas wrote:

                > Maybe the Content-Length field is not calculated correctly by
                > Apache::ASP?
                > If its too short, this could be a problem. To have Apache::ASP not
                > calculate
                > the content length, try to flush when the script first starts, that
                > will flush
                > the headers, and see if there is a difference.
                >
                > Regards,
                >
                > Josh
                >
                Sure enough, the Content-Length header was miscalculated. Since I had
                the ethereal capture handy, I could verify it rather easily (although
                had I been careful, I would have been able to figure it out myself -
                thanks for pointing it out).
                A ``$Response->Flush();'' fixed things for me.
                Still, in the beginning I wasn't using Script_OnFlush, I used a regular
                global.asa sub on the database data directly and I still got the same
                problem. Then I switched to Script_OnFlush to save myself the trouble of
                changing some 200+ files one by one.
                As far as the LANG is concerned, it has been correctly set to el_GR
                (meaning iso-8869-7)**** for as long as I can remember.
                I recall having problems to set it, because I specifically created an
                asp page telling whether a few greek words contain \w characters and to
                convert them to upper case, just to see it works. I see now that I had
                to set $ENV{'LANG'} in startup.pl to get it to work correctly.

                Thanks again for the quick response.

                Regards,
                Thanos Chatziathanassiou


                **** sorry, I really meant to say *ISO-8859-7*

                ---------------------------------------------------------------------
                To unsubscribe, e-mail: asp-unsubscribe@...
                For additional commands, e-mail: asp-help@...
              • Warren Young
                ... It probably will. I really like it that you can keep the same LANG variable as the system uses. I wouldn t like to have hard-coded it. Right now, the
                Message 7 of 9 , Mar 12, 2004
                  Josh Chamas wrote:

                  > PerlSetEnv LANG $LANG
                  >
                  > Does this resolve the UTF8 issues?

                  It probably will. I really like it that you can keep the same LANG
                  variable as the system uses. I wouldn't like to have hard-coded it.

                  Right now, the stable version of my program has worked around this issue
                  simply by building in understanding of where the conversions between
                  UTF-8 and ISO 8859 occur. In my development version, I had intended to
                  try for keeping data UTF-8 through the entire pipeline, so I will try
                  this. Thanks.

                  ---------------------------------------------------------------------
                  To unsubscribe, e-mail: asp-unsubscribe@...
                  For additional commands, e-mail: asp-help@...
                • Josh Chamas
                  ... I think if you have PerlPassEnv LANG, then you will merely pass what is set at the system level, so you can avoid hard coding it generally. When one
                  Message 8 of 9 , Mar 12, 2004
                    Warren Young wrote:
                    > Josh Chamas wrote:
                    >
                    >> PerlSetEnv LANG $LANG
                    >>
                    >> Does this resolve the UTF8 issues?
                    >
                    >
                    > It probably will. I really like it that you can keep the same LANG
                    > variable as the system uses. I wouldn't like to have hard-coded it.
                    >

                    I think if you have PerlPassEnv LANG, then you will merely pass what
                    is set at the system level, so you can avoid hard coding it generally.

                    When one develops with Oracle, one quickly finds out about this problem,
                    since Oracle clients require a host of %ENV settings in order to function
                    correctly, including a character set settting, but the standard one is
                    ORACLE_HOME.

                    Regards,

                    Josh
                    ________________________________________________________________________
                    Josh Chamas, Founder | NodeWorks - http://www.nodeworks.com
                    Chamas Enterprises Inc. | NodeWorks Directory - http://dir.nodeworks.com
                    http://www.chamas.com | Apache::ASP - http://www.apache-asp.org



                    ---------------------------------------------------------------------
                    To unsubscribe, e-mail: asp-unsubscribe@...
                    For additional commands, e-mail: asp-help@...
                  • Josh Chamas
                    ... Great. The Content-Length header is calculated like this in Response.pm: $self- {headers_out}- set( Content-Length , length($$out)); As you can see, it
                    Message 9 of 9 , Mar 12, 2004
                      Thanos Chatziathanassiou wrote:
                      > Josh Chamas wrote:
                      >
                      >> Maybe the Content-Length field is not calculated correctly by
                      >> Apache::ASP?
                      >> If its too short, this could be a problem. To have Apache::ASP not
                      >> calculate
                      >> the content length, try to flush when the script first starts, that
                      >> will flush
                      >> the headers, and see if there is a difference.
                      >>
                      >> Regards,
                      >>
                      >> Josh
                      >>
                      > Sure enough, the Content-Length header was miscalculated. Since I had
                      > the ethereal capture handy, I could verify it rather easily (although
                      > had I been careful, I would have been able to figure it out myself -
                      > thanks for pointing it out).
                      > A ``$Response->Flush();'' fixed things for me.
                      > Still, in the beginning I wasn't using Script_OnFlush, I used a regular
                      > global.asa sub on the database data directly and I still got the same
                      > problem. Then I switched to Script_OnFlush to save myself the trouble of
                      > changing some 200+ files one by one.

                      Great. The Content-Length header is calculated like this in Response.pm:

                      $self->{headers_out}->set('Content-Length', length($$out));

                      As you can see, it does not do anything but use perl's length() method.

                      So I wonder if you are using a perl that is UTF8 aware, like perl 5.8.x series?
                      Otherwise, is there anything you can do to make it aware like the LANG ENV
                      setting?

                      The only other thing I can think of is maybe we should not trust perl's
                      UTF8 handling generally for the length calculation, and if this is set:

                      $Response->{ContentType} = 'text/html;charset=UTF-8'

                      then we simply do not calculate the Content-Length automatically, leaving
                      it as an exercise for the developer?

                      Regards,

                      Josh

                      ________________________________________________________________________
                      Josh Chamas, Founder | NodeWorks - http://www.nodeworks.com
                      Chamas Enterprises Inc. | NodeWorks Directory - http://dir.nodeworks.com
                      http://www.chamas.com | Apache::ASP - http://www.apache-asp.org



                      ---------------------------------------------------------------------
                      To unsubscribe, e-mail: asp-unsubscribe@...
                      For additional commands, e-mail: asp-help@...
                    Your message has been successfully submitted and would be delivered to recipients shortly.