Loading ...
Sorry, an error occurred while loading the content.

Soap and encoding of non ASCII literals

Expand Messages
  • cedric.boufflers
    Hello, I am quite confused on how accented characters from different charsets should be dealt with. I have a web service written in PERL. As a convention I
    Message 1 of 6 , Jun 16, 2005
    View Source
    • 0 Attachment
      Hello,

      I am quite confused on how accented characters from different charsets should be dealt with.

      I have a web service written in PERL. As a convention I said that I want to receive all string literals encoded as UTF-8 because somewhere in the code I have to do something like "encoder->bytes('UTF-8')->iso_8859_1->data()". Thus if I do not receive UTF-8 data the method calls would crash.

      But a Java client would create a SOAP trame with UTF-8 encoded XML but would encode the literals with xml entities (&#x[code character in a defined charset];).

      • Actually the Java client we dealt with use ISO-8859-1 to encode literals that means the trame looks like this :

      HTTP encoding style :
      Content-Type : text/xml; charset = utf8;

      And in the xml the literal is encoded as "c&#xE9dric;" which is an ISO-8859-1 xml entities encoding.

      Thus my web service will get the literals as an ISO-8859-1 encoded string and will crash on the call of the "encoder->bytes('UTF-8')" method.

      This can be solved by enforcing string literals encoding to UTF-8 in Java.
      Thus I would have in the trame "c&#xXX;&#xXX;dric", which is a correct UTF-8 encoded string and would be seen as an UTF-8 in my perl webservice as well.


      • A Perl Soap Lite client doesn't use the XML entities, and it encodes the UTF-8 with two bytes. But on the server side the XML parser reconvert it into ISO-8859-1 for unknow reason, and thus the webservice crashes on the "encoder" method.
      So I am quite confused and lost on how the encoded literals should be dealt with in SOAP, is there a standard for that ? How could I deal with the encodings problem? How do you deal with encoding in your PERL web services ?

      Thanks for any help, or enlightenment on this topic.

      Kind Regards,
      Cédric

      -- 
      ---------------------------------------------------------------------
      BOUFFLERS Cédric : cedric.boufflers@...
      ---------------------------------------------------------------------
      NordNet - 111 Rue de Croix - 59510 Hem - France
      tél : +33 3 20 66 55 55 - fax : +33 3 20 66 55 59
      ---------------------------------------------------------------------
      http://www.securitoo.com/
      http://www.nordnet.fr/
      http://www.lerelaisinternet.com/
      ---------------------------------------------------------------------
      
    • Duncan Cameron
      ... want ... the ... receive ... ___________________________________________________________ How much free photo storage do you get? Store your holiday snaps
      Message 2 of 6 , Jun 16, 2005
      View Source
      • 0 Attachment
        At 2005-06-16, 10:20:48 you wrote:

        >Hello,
        >
        >I am quite confused on how accented characters from different charsets

        >should be dealt with.
        >
        >I have a web service written in PERL. As a convention I said that I
        want
        >to receive all string literals encoded as UTF-8 because somewhere in
        the
        >code I have to do something like
        >"encoder->bytes('UTF-8')->iso_8859_1->data()". Thus if I do not
        receive
        >UTF-8 data the method calls would crash.
        >
        >But a Java client would create a SOAP trame with UTF-8 encoded XML but

        >would encode the literals with xml entities (



        ___________________________________________________________
        How much free photo storage do you get? Store your holiday
        snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
      • Duncan Cameron
        Hi Cédric I don t fully understand what problem you are having but there is a basic misunderstanding in your explanation above. By definition XML uses the
        Message 3 of 6 , Jun 16, 2005
        View Source
        • 0 Attachment
          Hi Cédric

          I don't fully understand what problem you are having but there is a
          basic misunderstanding in your explanation above.

          By definition XML uses the Unicode character codes. ISO-8859-x and
          UTF-8 are ways of encoding physical XML documents, but when the
          documents are parsed the result will be in Unicode. It just happens
          that perl uses UTF-8 to hold its internal strings (sometimes). So you
          shouldn't have to do anything special to get strings encoded as UTF-8.


          You say
          >And in the xml the literal is encoded as "cຝric;" which is an
          >ISO-8859-1 xml entities encoding.

          This is a correct use of a numeric entity to refer to the Unicode
          character point 0x00E9. You would need to use a numeric entity if the
          XML was being constructed in ISO-8859-1, which wouldn't normally be the
          case with perl.

          You say
          >This can be solved by enforcing string literals encoding to UTF-8 in
          Java.
          >Thus I would have in the trame "c&#xXX;&#xXX;dric", which is a correct

          >UTF-8 encoded string and would be seen as an UTF-8 in my perl
          webservice
          >as well.

          This does look as if you are misunderstanding the encoding. You seem to
          be trying to encode the two byte UTF-8 representation of the Unicode
          point 0x00E9. In effect this is double encoding, which is incorrect.

          In summary

          using a numeric entity é is valid for both UTF-8 and ISO-8859-1

          if you are using UTF-8, then the é character encoded as two bytes is
          correct. Depending on your perl version, this should be automatic. Your
          Java client would need to explicitly convert to UTF-8 (something like
          str.getBytes("UTF-8")).

          Regards

          Duncan





          ___________________________________________________________
          How much free photo storage do you get? Store your holiday
          snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
        • cedric.boufflers
          ... Hello Duncan, So if understand well I might be trying to double encode the strings. But what has made me done that, and might have misleading me, is the
          Message 4 of 6 , Jun 16, 2005
          View Source
          • 0 Attachment
            Duncan Cameron a écrit :
            Hi Cédric
            
            I don't fully understand what problem you are having but there is a
            basic misunderstanding in your explanation above.
            
            By definition XML uses the Unicode character codes.  ISO-8859-x and
            UTF-8 are ways of encoding physical XML documents, but when the
            documents are parsed the result will be in Unicode. It just happens
            that perl uses UTF-8 to hold its internal strings (sometimes). So you
            shouldn't have to do anything special to get strings encoded as UTF-8.
            
            
            You say
              
            And in the xml the literal is encoded as "c&#xE9dric;" which is an 
            ISO-8859-1 xml entities encoding.
                
            This is a correct use of a numeric entity to refer to the Unicode
            character point 0x00E9. You would need to use a numeric entity if the
            XML was being constructed in ISO-8859-1, which wouldn't normally be the
            case with perl.
            
            You say
              
            This can be solved by enforcing string literals encoding to UTF-8 in
                
            Java.
              
            Thus I would have in the trame "c&#xXX;&#xXX;dric", which is a correct
                
              
            UTF-8 encoded string and would be seen as an UTF-8 in my perl
                
            webservice 
              
            as well.
                
            This does look as if you are misunderstanding the encoding. You seem to
            be trying to encode the two byte UTF-8 representation of the Unicode
            point 0x00E9. In effect this is double encoding, which is incorrect.
            
            In summary
            
            using a numeric entity é is valid for both UTF-8 and ISO-8859-1
            
            if you are using UTF-8, then the é character encoded as two bytes is
            correct. Depending on your perl version, this should be automatic. Your
            Java client would need to explicitly convert to UTF-8 (something like
            str.getBytes("UTF-8")).
            
              

            Hello Duncan,

            So if understand well I might be trying to double encode the strings.

            But what has made me done that, and might have misleading me, is the error I was getting from the encoding method :

            my method call was :

            use Encode::Encoder qw/encoder/
            encoder($string)->bytes('UTF-8')->iso_8859_1;

            And this was giving me the error "\xE9 does not map to UTF-8". So this why I thought that é was not a valid UTF-8 code.

            But then this might not be a SOAP problem but an encoder method problem ? Do you have any hint of why it refuses to read the string as an UTF-8 one then?

            I'm sorry because the more I learn on encoding the more I seem to get confused with it ;)

            Actually my goal is the following :

            My Web Service has to write in a database encoded in Latin-1. So I have to encode the UTF-8 string to Latin-1, otherwise the data are not stored correctly in the database. What would be the proper way to ensure that whatever the SOAP client used (java, delphi, perl, php, ...) I will get UTF-8 string that I can encode in Latin-1 in the PERL Web service?

            If this problem is not SOAP::Lite related, do you have any hints of a list where I could get help for it ? :)

            Note : Perl is 5.8 and is running under Apache1.3/mod_perl.

            Thanks a lot for your help and explanations,

            Best Regards,
            Cédric


            Regards
            
            Duncan
            
              

            -- 
            ---------------------------------------------------------------------
            BOUFFLERS Cédric : cedric.boufflers@...
            ---------------------------------------------------------------------
            NordNet - 111 Rue de Croix - 59510 Hem - France
            tél : +33 3 20 66 55 55 - fax : +33 3 20 66 55 59
            ---------------------------------------------------------------------
            http://www.securitoo.com/
            http://www.nordnet.fr/
            http://www.lerelaisinternet.com/
            ---------------------------------------------------------------------
            
          • Duncan Cameron
            ... problem ... UTF-8 ... have ... stored ... get ... My understanding is that all the parameters passed to your server class will be marked as UTF-8 (because
            Message 5 of 6 , Jun 16, 2005
            View Source
            • 0 Attachment
              At 2005-06-16, 14:05:55 you wrote:
              >
              >Hello Duncan,
              >
              >So if understand well I might be trying to double encode the strings.
              >
              >But what has made me done that, and might have misleading me, is the
              >error I was getting from the encoding method :
              >
              >my method call was :
              >
              >use Encode::Encoder qw/encoder/
              >encoder($string)->bytes('UTF-8')->iso_8859_1;
              >
              >And this was giving me the error "\xE9 does not map to UTF-8". So this

              >why I thought that é was not a valid UTF-8 code.
              >
              >But then this might not be a SOAP problem but an encoder method
              problem
              >? Do you have any hint of why it refuses to read the string as an
              UTF-8
              >one then?
              >
              >I'm sorry because the more I learn on encoding the more I seem to get
              >confused with it ;)
              >
              >Actually my goal is the following :
              >
              >My Web Service has to write in a database encoded in Latin-1. So I
              have
              >to encode the UTF-8 string to Latin-1, otherwise the data are not
              stored
              >correctly in the database. What would be the proper way to ensure that

              >whatever the SOAP client used (java, delphi, perl, php, ...) I will
              get
              >UTF-8 string that I can encode in Latin-1 in the PERL Web service?
              >
              >If this problem is not SOAP::Lite related, do you have any hints of a
              >list where I could get help for it ? :)
              >
              >Note : Perl is 5.8 and is running under Apache1.3/mod_perl.
              >
              >Thanks a lot for your help and explanations,
              >
              >Best Regards,
              >Cédric
              >
              My understanding is that all the parameters passed to your server class
              will be marked as UTF-8 (because they have been through the XML
              parser), so you should be able to convert a string to 8859-1 in this
              way:

              my $octets = encode("iso-8859-1", $string, 1);

              this should throw an error if $string contains characters that are not
              in 8859-1, so you will need to handle that event within an eval.

              Regards

              Duncan




              ___________________________________________________________
              How much free photo storage do you get? Store your holiday
              snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
            • cedric.boufflers
              Hello Duncan and the list readers, I have been doing some experimentation. I have written a simple Web service in PERL : This is my method : sub get_Champs {
              Message 6 of 6 , Jun 24, 2005
              View Source
              • 0 Attachment
                Hello Duncan and the list readers,

                I have been doing some experimentation. I have written a simple Web
                service in PERL :

                This is my method :

                sub get_Champs
                {
                my $class = shift;
                my $envelope = pop;

                my $champ = $envelope->valueof("//get_Champs/Champ");

                use Data::HexDump;

                return SOAP::Data->name('result' => HexDump($champ));
                }

                It just return an Hexadecimal dump of the string received.

                I have called it with a Java standard Java client :


                *- First Test*
                System.out.println(wsenc.get_Champs("cédric"));

                In response I had :
                00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 0123456789ABCDEF

                00000000 63 E9 64 72 69 63 c.dric

                So it seems my accent is encoded on a single byte there, and PERL does
                not deal the string in UTF-8 in this case.


                *- Second Test*
                System.out.println(wsenc.get_Champs(new
                String("cédric".getBytes("UTF-8"))));

                In response I had :
                00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F
                0123456789ABCDEF

                00000000 63 C3 A9 64 72 69 63 c..dric

                In this case I have a double bytes encoded accentued character. Is it
                because in this case I am doing double encoding? Although in this case
                in PERL it is seen as an UTF-8 string.

                How could I force PERL or SOAP::Lite to always deal with the string in
                UTF-8 ?

                I have tried to add this lines :

                use POSIX qw(locale_h);
                setlocale(LC_CTYPE, "en_US.UTF-8");

                But it changes nothing and the default locale of the computer is :
                LANG=en_US.UTF-8
                LANGVAR=en_US.UTF-8

                But nothing does.

                Best Regards,
                And thank you for your help.

                Cédric

                Note :
                SOAP::Lite is 0.60
                Perl is perl5 (revision 5.0 version 8 subversion 0).



                Duncan Cameron a écrit :

                >At 2005-06-16, 14:05:55 you wrote:
                >
                >
                >>Hello Duncan,
                >>
                >>So if understand well I might be trying to double encode the strings.
                >>
                >>But what has made me done that, and might have misleading me, is the
                >>error I was getting from the encoding method :
                >>
                >>my method call was :
                >>
                >>use Encode::Encoder qw/encoder/
                >>encoder($string)->bytes('UTF-8')->iso_8859_1;
                >>
                >>And this was giving me the error "\xE9 does not map to UTF-8". So this
                >>
                >>
                >
                >
                >
                >>why I thought that é was not a valid UTF-8 code.
                >>
                >>But then this might not be a SOAP problem but an encoder method
                >>
                >>
                >problem
                >
                >
                >>? Do you have any hint of why it refuses to read the string as an
                >>
                >>
                >UTF-8
                >
                >
                >>one then?
                >>
                >>I'm sorry because the more I learn on encoding the more I seem to get
                >>confused with it ;)
                >>
                >>Actually my goal is the following :
                >>
                >>My Web Service has to write in a database encoded in Latin-1. So I
                >>
                >>
                >have
                >
                >
                >>to encode the UTF-8 string to Latin-1, otherwise the data are not
                >>
                >>
                >stored
                >
                >
                >>correctly in the database. What would be the proper way to ensure that
                >>
                >>
                >
                >
                >
                >>whatever the SOAP client used (java, delphi, perl, php, ...) I will
                >>
                >>
                >get
                >
                >
                >>UTF-8 string that I can encode in Latin-1 in the PERL Web service?
                >>
                >>If this problem is not SOAP::Lite related, do you have any hints of a
                >>list where I could get help for it ? :)
                >>
                >>Note : Perl is 5.8 and is running under Apache1.3/mod_perl.
                >>
                >>Thanks a lot for your help and explanations,
                >>
                >>Best Regards,
                >>Cédric
                >>
                >>
                >>
                >My understanding is that all the parameters passed to your server class
                >will be marked as UTF-8 (because they have been through the XML
                >parser), so you should be able to convert a string to 8859-1 in this
                >way:
                >
                >my $octets = encode("iso-8859-1", $string, 1);
                >
                >this should throw an error if $string contains characters that are not
                >in 8859-1, so you will need to handle that event within an eval.
                >
                >Regards
                >
                >Duncan
                >
                >
                >
                >
                >___________________________________________________________
                >How much free photo storage do you get? Store your holiday
                >snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
                >
                >
                >
                >Yahoo! Groups Links
                >
                >
                >
                >
                >
                >
                >
                >
                >
                >


                --
                ---------------------------------------------------------------------
                BOUFFLERS Cédric : cedric.boufflers@...
                ---------------------------------------------------------------------
                NordNet - 111 Rue de Croix - 59510 Hem - France
                tél : +33 3 20 66 55 55 - fax : +33 3 20 66 55 59
                ---------------------------------------------------------------------
                http://www.securitoo.com/
                http://www.nordnet.fr/
                http://www.lerelaisinternet.com/
                ---------------------------------------------------------------------
              Your message has been successfully submitted and would be delivered to recipients shortly.