Loading ...
Sorry, an error occurred while loading the content.

Re: [soaplite] Soap and encoding of non ASCII literals

Expand Messages
  • Duncan Cameron
    Hi Cédric I don t fully understand what problem you are having but there is a basic misunderstanding in your explanation above. By definition XML uses the
    Message 1 of 6 , Jun 16, 2005
    • 0 Attachment
      Hi Cédric

      I don't fully understand what problem you are having but there is a
      basic misunderstanding in your explanation above.

      By definition XML uses the Unicode character codes. ISO-8859-x and
      UTF-8 are ways of encoding physical XML documents, but when the
      documents are parsed the result will be in Unicode. It just happens
      that perl uses UTF-8 to hold its internal strings (sometimes). So you
      shouldn't have to do anything special to get strings encoded as UTF-8.


      You say
      >And in the xml the literal is encoded as "cຝric;" which is an
      >ISO-8859-1 xml entities encoding.

      This is a correct use of a numeric entity to refer to the Unicode
      character point 0x00E9. You would need to use a numeric entity if the
      XML was being constructed in ISO-8859-1, which wouldn't normally be the
      case with perl.

      You say
      >This can be solved by enforcing string literals encoding to UTF-8 in
      Java.
      >Thus I would have in the trame "c&#xXX;&#xXX;dric", which is a correct

      >UTF-8 encoded string and would be seen as an UTF-8 in my perl
      webservice
      >as well.

      This does look as if you are misunderstanding the encoding. You seem to
      be trying to encode the two byte UTF-8 representation of the Unicode
      point 0x00E9. In effect this is double encoding, which is incorrect.

      In summary

      using a numeric entity é is valid for both UTF-8 and ISO-8859-1

      if you are using UTF-8, then the é character encoded as two bytes is
      correct. Depending on your perl version, this should be automatic. Your
      Java client would need to explicitly convert to UTF-8 (something like
      str.getBytes("UTF-8")).

      Regards

      Duncan





      ___________________________________________________________
      How much free photo storage do you get? Store your holiday
      snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
    • cedric.boufflers
      ... Hello Duncan, So if understand well I might be trying to double encode the strings. But what has made me done that, and might have misleading me, is the
      Message 2 of 6 , Jun 16, 2005
      • 0 Attachment
        Duncan Cameron a écrit :
        Hi Cédric
        
        I don't fully understand what problem you are having but there is a
        basic misunderstanding in your explanation above.
        
        By definition XML uses the Unicode character codes.  ISO-8859-x and
        UTF-8 are ways of encoding physical XML documents, but when the
        documents are parsed the result will be in Unicode. It just happens
        that perl uses UTF-8 to hold its internal strings (sometimes). So you
        shouldn't have to do anything special to get strings encoded as UTF-8.
        
        
        You say
          
        And in the xml the literal is encoded as "c&#xE9dric;" which is an 
        ISO-8859-1 xml entities encoding.
            
        This is a correct use of a numeric entity to refer to the Unicode
        character point 0x00E9. You would need to use a numeric entity if the
        XML was being constructed in ISO-8859-1, which wouldn't normally be the
        case with perl.
        
        You say
          
        This can be solved by enforcing string literals encoding to UTF-8 in
            
        Java.
          
        Thus I would have in the trame "c&#xXX;&#xXX;dric", which is a correct
            
          
        UTF-8 encoded string and would be seen as an UTF-8 in my perl
            
        webservice 
          
        as well.
            
        This does look as if you are misunderstanding the encoding. You seem to
        be trying to encode the two byte UTF-8 representation of the Unicode
        point 0x00E9. In effect this is double encoding, which is incorrect.
        
        In summary
        
        using a numeric entity é is valid for both UTF-8 and ISO-8859-1
        
        if you are using UTF-8, then the é character encoded as two bytes is
        correct. Depending on your perl version, this should be automatic. Your
        Java client would need to explicitly convert to UTF-8 (something like
        str.getBytes("UTF-8")).
        
          

        Hello Duncan,

        So if understand well I might be trying to double encode the strings.

        But what has made me done that, and might have misleading me, is the error I was getting from the encoding method :

        my method call was :

        use Encode::Encoder qw/encoder/
        encoder($string)->bytes('UTF-8')->iso_8859_1;

        And this was giving me the error "\xE9 does not map to UTF-8". So this why I thought that é was not a valid UTF-8 code.

        But then this might not be a SOAP problem but an encoder method problem ? Do you have any hint of why it refuses to read the string as an UTF-8 one then?

        I'm sorry because the more I learn on encoding the more I seem to get confused with it ;)

        Actually my goal is the following :

        My Web Service has to write in a database encoded in Latin-1. So I have to encode the UTF-8 string to Latin-1, otherwise the data are not stored correctly in the database. What would be the proper way to ensure that whatever the SOAP client used (java, delphi, perl, php, ...) I will get UTF-8 string that I can encode in Latin-1 in the PERL Web service?

        If this problem is not SOAP::Lite related, do you have any hints of a list where I could get help for it ? :)

        Note : Perl is 5.8 and is running under Apache1.3/mod_perl.

        Thanks a lot for your help and explanations,

        Best Regards,
        Cédric


        Regards
        
        Duncan
        
          

        -- 
        ---------------------------------------------------------------------
        BOUFFLERS Cédric : cedric.boufflers@...
        ---------------------------------------------------------------------
        NordNet - 111 Rue de Croix - 59510 Hem - France
        tél : +33 3 20 66 55 55 - fax : +33 3 20 66 55 59
        ---------------------------------------------------------------------
        http://www.securitoo.com/
        http://www.nordnet.fr/
        http://www.lerelaisinternet.com/
        ---------------------------------------------------------------------
        
      • Duncan Cameron
        ... problem ... UTF-8 ... have ... stored ... get ... My understanding is that all the parameters passed to your server class will be marked as UTF-8 (because
        Message 3 of 6 , Jun 16, 2005
        • 0 Attachment
          At 2005-06-16, 14:05:55 you wrote:
          >
          >Hello Duncan,
          >
          >So if understand well I might be trying to double encode the strings.
          >
          >But what has made me done that, and might have misleading me, is the
          >error I was getting from the encoding method :
          >
          >my method call was :
          >
          >use Encode::Encoder qw/encoder/
          >encoder($string)->bytes('UTF-8')->iso_8859_1;
          >
          >And this was giving me the error "\xE9 does not map to UTF-8". So this

          >why I thought that é was not a valid UTF-8 code.
          >
          >But then this might not be a SOAP problem but an encoder method
          problem
          >? Do you have any hint of why it refuses to read the string as an
          UTF-8
          >one then?
          >
          >I'm sorry because the more I learn on encoding the more I seem to get
          >confused with it ;)
          >
          >Actually my goal is the following :
          >
          >My Web Service has to write in a database encoded in Latin-1. So I
          have
          >to encode the UTF-8 string to Latin-1, otherwise the data are not
          stored
          >correctly in the database. What would be the proper way to ensure that

          >whatever the SOAP client used (java, delphi, perl, php, ...) I will
          get
          >UTF-8 string that I can encode in Latin-1 in the PERL Web service?
          >
          >If this problem is not SOAP::Lite related, do you have any hints of a
          >list where I could get help for it ? :)
          >
          >Note : Perl is 5.8 and is running under Apache1.3/mod_perl.
          >
          >Thanks a lot for your help and explanations,
          >
          >Best Regards,
          >Cédric
          >
          My understanding is that all the parameters passed to your server class
          will be marked as UTF-8 (because they have been through the XML
          parser), so you should be able to convert a string to 8859-1 in this
          way:

          my $octets = encode("iso-8859-1", $string, 1);

          this should throw an error if $string contains characters that are not
          in 8859-1, so you will need to handle that event within an eval.

          Regards

          Duncan




          ___________________________________________________________
          How much free photo storage do you get? Store your holiday
          snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
        • cedric.boufflers
          Hello Duncan and the list readers, I have been doing some experimentation. I have written a simple Web service in PERL : This is my method : sub get_Champs {
          Message 4 of 6 , Jun 24, 2005
          • 0 Attachment
            Hello Duncan and the list readers,

            I have been doing some experimentation. I have written a simple Web
            service in PERL :

            This is my method :

            sub get_Champs
            {
            my $class = shift;
            my $envelope = pop;

            my $champ = $envelope->valueof("//get_Champs/Champ");

            use Data::HexDump;

            return SOAP::Data->name('result' => HexDump($champ));
            }

            It just return an Hexadecimal dump of the string received.

            I have called it with a Java standard Java client :


            *- First Test*
            System.out.println(wsenc.get_Champs("cédric"));

            In response I had :
            00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 0123456789ABCDEF

            00000000 63 E9 64 72 69 63 c.dric

            So it seems my accent is encoded on a single byte there, and PERL does
            not deal the string in UTF-8 in this case.


            *- Second Test*
            System.out.println(wsenc.get_Champs(new
            String("cédric".getBytes("UTF-8"))));

            In response I had :
            00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F
            0123456789ABCDEF

            00000000 63 C3 A9 64 72 69 63 c..dric

            In this case I have a double bytes encoded accentued character. Is it
            because in this case I am doing double encoding? Although in this case
            in PERL it is seen as an UTF-8 string.

            How could I force PERL or SOAP::Lite to always deal with the string in
            UTF-8 ?

            I have tried to add this lines :

            use POSIX qw(locale_h);
            setlocale(LC_CTYPE, "en_US.UTF-8");

            But it changes nothing and the default locale of the computer is :
            LANG=en_US.UTF-8
            LANGVAR=en_US.UTF-8

            But nothing does.

            Best Regards,
            And thank you for your help.

            Cédric

            Note :
            SOAP::Lite is 0.60
            Perl is perl5 (revision 5.0 version 8 subversion 0).



            Duncan Cameron a écrit :

            >At 2005-06-16, 14:05:55 you wrote:
            >
            >
            >>Hello Duncan,
            >>
            >>So if understand well I might be trying to double encode the strings.
            >>
            >>But what has made me done that, and might have misleading me, is the
            >>error I was getting from the encoding method :
            >>
            >>my method call was :
            >>
            >>use Encode::Encoder qw/encoder/
            >>encoder($string)->bytes('UTF-8')->iso_8859_1;
            >>
            >>And this was giving me the error "\xE9 does not map to UTF-8". So this
            >>
            >>
            >
            >
            >
            >>why I thought that é was not a valid UTF-8 code.
            >>
            >>But then this might not be a SOAP problem but an encoder method
            >>
            >>
            >problem
            >
            >
            >>? Do you have any hint of why it refuses to read the string as an
            >>
            >>
            >UTF-8
            >
            >
            >>one then?
            >>
            >>I'm sorry because the more I learn on encoding the more I seem to get
            >>confused with it ;)
            >>
            >>Actually my goal is the following :
            >>
            >>My Web Service has to write in a database encoded in Latin-1. So I
            >>
            >>
            >have
            >
            >
            >>to encode the UTF-8 string to Latin-1, otherwise the data are not
            >>
            >>
            >stored
            >
            >
            >>correctly in the database. What would be the proper way to ensure that
            >>
            >>
            >
            >
            >
            >>whatever the SOAP client used (java, delphi, perl, php, ...) I will
            >>
            >>
            >get
            >
            >
            >>UTF-8 string that I can encode in Latin-1 in the PERL Web service?
            >>
            >>If this problem is not SOAP::Lite related, do you have any hints of a
            >>list where I could get help for it ? :)
            >>
            >>Note : Perl is 5.8 and is running under Apache1.3/mod_perl.
            >>
            >>Thanks a lot for your help and explanations,
            >>
            >>Best Regards,
            >>Cédric
            >>
            >>
            >>
            >My understanding is that all the parameters passed to your server class
            >will be marked as UTF-8 (because they have been through the XML
            >parser), so you should be able to convert a string to 8859-1 in this
            >way:
            >
            >my $octets = encode("iso-8859-1", $string, 1);
            >
            >this should throw an error if $string contains characters that are not
            >in 8859-1, so you will need to handle that event within an eval.
            >
            >Regards
            >
            >Duncan
            >
            >
            >
            >
            >___________________________________________________________
            >How much free photo storage do you get? Store your holiday
            >snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
            >
            >
            >
            >Yahoo! Groups Links
            >
            >
            >
            >
            >
            >
            >
            >
            >
            >


            --
            ---------------------------------------------------------------------
            BOUFFLERS Cédric : cedric.boufflers@...
            ---------------------------------------------------------------------
            NordNet - 111 Rue de Croix - 59510 Hem - France
            tél : +33 3 20 66 55 55 - fax : +33 3 20 66 55 59
            ---------------------------------------------------------------------
            http://www.securitoo.com/
            http://www.nordnet.fr/
            http://www.lerelaisinternet.com/
            ---------------------------------------------------------------------
          Your message has been successfully submitted and would be delivered to recipients shortly.