Loading ...
Sorry, an error occurred while loading the content.
 

Encoding of strings for XML and for URIs

Expand Messages
  • Misha Wolf
    Hi Martin and Richard, It s a long time since we ve been in touch -- I hope you and yours are well. Please could you advise or point me at suitable resources
    Message 1 of 4 , Mar 25, 2009
      Hi Martin and Richard,
       
      It's a long time since we've been in touch -- I hope you and yours are well.
       
      Please could you advise or point me at suitable resources ...
       
      We (the IPTC) are struggling to specify how and when one should escape problematic chars when going from the logical string to XML and then to URIs.  Let us say we have the Code "B&F003000000RU" in the Scheme "ISLC".  We construct the QCode (rather like a QName) "ISLC:B&F003000000RU".  On the logical level we then have:
       
          <subject qcode="ISLC:B&F003000000RU" />
       
      Presumably, this should become:
       
          <subject qcode="ISLC:B&amp;F003000000RU" />
       
      in XML.
       
      Our rules say that we then append the Code to the Scheme URI, to create the Code URI.  If the Scheme URI is:
       
       
      should the Code URI be:
       
      B%26F003000000RU
       
      ?
        
      Many thanks,
      Misha
       

      From: Misha Wolf
      Sent: 25 March 2009 13:24
      To: 'iptc-news-architecture-dev@yahoogroups.com'; Dave Compton
      Subject: RE: [IPTC-NAR-dev] QCode encoding

      It would be very bad if the IPTC invented its own way of doing these things.  And it would also be very bad if various members of the IPTC did these things differently.  AFAICS, we have here a protocol stack:
       
          top: a sequence of characters in the logical tokens
       
          mid: that character sequence transformed to fit the rules of XML
       
          lowest: that character sequence transformed to fit the rules of URIs
       
      This protocol stack is used all over the place.  Please let's not invent another one.
       
      Regards,
      Misha


      From: iptc-news-architecture-dev@yahoogroups.com [mailto:iptc-news-architecture-dev@yahoogroups.com] On Behalf Of Michael Steidl (IPTC)
      Sent: 25 March 2009 10:40
      To: Dave Compton; iptc-news-architecture-dev@yahoogroups.com
      Subject: RE: [IPTC-NAR-dev] QCode encoding

      Dave

      my view on this is after the discussion in January and February is:

      a) a QCode is only a special format of the Concept URI: the scheme URI part is replaced  by the scheme alias. Conclusion: any encoding required for a concept URI must be reflected by the QCode.

      b) it is the provider's responsibility to apply the correct encoding.

      c) lexical comparison should be done without the encoding:

      Initial QCode =  ISLC:COIMBATORE, TAMIL NADU, INDIA 

      Encoded QCode = ISLC:COIMBATORE,%20TAMIL%20NADU,%20INDIA

      Decoded QCode = ISLC:COIMBATORE, TAMIL NADU, INDIA

      The crucial issue is the concept URI, assuming ISLC stands for http://cv.reuters.com/g2-cv/, the URL would be:

      without encoding: http://cv.reuters.com/g2-cv/islc/COIMBATORE, TAMIL NADU, INDIA  = invalid!!

      with encoding: http://cv.reuters.com/g2- cv/islc/COIMBATORE,%20TAMIL%20NADU,%20INDIA = valid !!

      d) this leads to a background issue where I know that you/TR have a slightly different approach as Laurent and others, including me, had: should the lexical comparison be done at the level of QCodes or at the level of Concept URIs. I recall you preferred the QCode level, others preferred the only really unique level, the concept URI.

      e) there is an issue with your approach below:

      TR delivers for a concept: R:0#.FTSE

      The user has implemented lexical comparison at the URI level, thus the receiver has to consider whether to encode the # or not. Exactly this violates the rule b) above. The only task thrown at the receiver is to decode percentage-encoding.

      f) finally: test:A/B%20C&amp;D  is wrong, &amp; is an invalid encoding, by the URI RFC **only'** percentage-encoding is valid.

      Michael

      --------------------------------------------------
      On 25 Mar 2009 at 10:18  Dave Compton wrote:

      > a/ R:0%23.FTSE (!! # is a delimiter, must be escaped!)
      I can understand that the above is needed when using the value in the related URI, but I thought the process we discussed was:  
      Values:  
      R:0#.FTSE  
      test:A/B C&D  
      To use as a QCode value: apply whitespace encoding and handle & etc.  

      R:0#.FTSE   : no change  

      test:A/B%20C&amp;D    : only whitespace and & in this case  

      To use in a URI: apply RFC 3986:  

      R:0%23.FTSE
      test:A%2FB%20C%26D  

      Rgds
      DC


      From: Michael Steidl (IPTC) [mailto:mdirector@...]
      Sent: 25 March 2009 08:59
      To: Dave Compton; iptc-news-architecture-dev@yahoogroups.com
      Subject: RE: [IPTC-NAR-dev] QCode encoding
      Dave

      quote from the Conf Call notes of 24 Feb 2009:

      <quote>
      *** Escaping of characters in concept URIs.

      The answer is in the URI RFC 3986:

      2.1. Percent-Encoding

        A percent-encoding mechanism is used to represent a data octet in a
        component when that octet's corresponding character is outside the
        allowed set or is being used as a delimiter of, or within, the
        component.

      </quote>

      Therefore your QCodes become:

      a/ R:0%23.FTSE (!! # is a delimiter, must be escaped!)
      b/ ISLC:B%26F003000000RU
      c/ ISLC:COIMBATORE,%20TAMIL%20NADU,%20INDIA
      d/ NI:MET%2FOPTION  (!! slash is a delimiter, must be escaped!)

      Michael

      --------------------------------------------------
      On 25 Mar 2009 at 8:16  Dave Compton wrote:

      Please remind me where we got to re encoding QCode values - when including in the XML, as opposed to forming URIs.   

      I need to implement the encoding required for (at least) the following real codes asap:   

      a/ R:0#.FTSE   
      b/ ISLC:B&F003000000RU   
      c/ ISLC:COIMBATORE, TAMIL NADU, INDIA   
      d/ NI:MET/OPTION   

      Do these result in the following in the XML?   
      a/ R:0#.FTSE   
      b/ ISLC:B&amp;F003000000RU   
      c/ ISLC:COIMBATORE,&#x20;TAMIL&#x20;NADU, &#x20;INDIA   
      d/ NI:MET/OPTION   


      Rgds   
      DC   


      From: iptc-news-architecture-dev@yahoogroups.com [mailto:iptc-news-architecture- dev@yahoogroups.com] On Behalf Of Michael Steidl (IPTC)
      Sent: 13 February 2009 13:55
      To: Misha Wolf; iptc-news-architecture-dev@yahoogroups.com
      Subject: RE: [IPTC-NAR-dev] QCode encoding
       Misha

      we have two "latter" options.

      As I remember we agreed QCodes are "URL encoded" which means all characters not allowed in an URL have to be %(hex-value) encoded.

      Michael

      On 12 Feb 2009 at 16:01  Misha Wolf wrote:

      >   
      > The latter.   
      >   
      > Rgds,   
      > Misha   
      >   
      >   
      > From: iptc-news-architecture-dev@yahoogroups.com   
      >[mailto:iptc-news-architecture-   
      > dev@yahoogroups.com] On Behalf Of Dave Compton   
      > Sent: 12 February 2009 15:12   
      > To: iptc-news-architecture-dev@yahoogroups.com   
      > Subject: [IPTC-NAR-dev] QCode encoding   
      > We've agreed that whitespace needs to be % encoded.   
      >   
      > What about other 'reserved' chars such as '&'.   
      >   
      > Actual example:   
      > Does   
      > <subject qcode="ISLC:B&F003000000RU" />   
      > become:   
      > <subject qcode="ISLC:B%26F003000000RU" />   
      > or should it be:   
      > <subject qcode="ISLC:B&amp;F003000000RU" />   
      >   
      >   
      > Rgds   
      > DC   
      > This email was sent to you by Thomson Reuters, the global news and   
      > information company.   
      > Any views expressed in this message are those of the individual sender,   
      > except where the sender   
      > specifically states them to be the views of Thomson Reuters.   
      > This email was sent to you by Thomson Reuters, the global news and   
      > information company.   
      > Any views expressed in this message are those of the individual sender,   
      > except where the sender   
      > specifically states them to be the views of Thomson Reuters.   
      >   
      >   


      ======================================= ===========
      Sent by:
      Michael Steidl
      Managing Director of the IPTC <mdirector@...>
      International Press Telecommunications Council
      "Information Technology for News"
      Visit us on the web at  http://www.iptc.org   


      This email was sent to you by Thomson Reuters, the global news and information company.
      Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters.

      ==================================================
      Sent by:
      Michael Steidl
      Managing Director of the IPTC <mdirector@...>
      International Press Telecommunications Council
      "Information Technology for News"
      Visit us on the web at  http://www.iptc.org  
      This email was sent to you by Thomson Reuters, the global news and information company.
      Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters.

      ==================================================
      Sent by:
      Michael Steidl
      Managing Director of the IPTC <mdirector@...>
      International Press Telecommunications Council
      "Information Technology for News"
      Visit us on the web at  http://www.iptc.org
       

      This email was sent to you by Thomson Reuters, the global news and information company.
      Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters.
    • Raphaël Troncy
      Dear Misha, [snip] ... Yes, to be a legal URI, you would need to %-escape the ampersand. So your construction is the correct one, per RFC3986 [1]. You might
      Message 2 of 4 , Mar 25, 2009
        Dear Misha,

        [snip]

        > We (the IPTC) are struggling to specify how and when one should escape
        > problematic chars when going from the logical string to XML and then to
        > URIs. Let us say we have the Code "B&F003000000RU" in the Scheme
        > "ISLC". We construct the QCode (rather like a QName)
        > "ISLC:B&F003000000RU". On the logical level we then have:
        >
        > <subject qcode="ISLC:B&F003000000RU" />
        >
        > Our rules say that we then append the Code to the Scheme URI, to create
        > the Code URI. If the Scheme URI is:
        >
        > http://www.example.org <http://www.example.org>#
        >
        > should the Code URI be:
        >
        > http://www.example.org# <http://www.example.org#B%26F003000000RU>
        > B%26F003000000RU <http://www.example.org#B%26F003000000RU>


        Yes, to be a legal URI, you would need to %-escape the ampersand. So
        your construction is the correct one, per RFC3986 [1]. You might want to
        use also IRIs per RFC3987 [2] for a non %-escaped version.
        Cheers.

        Raphaël

        [1] http://tools.ietf.org/html/rfc3986
        [2] http://tools.ietf.org/html/rfc3987

        --
        Raphaël Troncy
        CWI (Centre for Mathematics and Computer Science),
        Science Park 123, 1098 XG Amsterdam, The Netherlands
        e-mail: raphael.troncy@... & raphael.troncy@...
        Tel: +31 (0)20 - 592 4093
        Fax: +31 (0)20 - 592 4312
        Web: http://www.cwi.nl/~troncy/
      • Misha Wolf
        Hi Raphaël, I m sure I used to know all this but the relevant brain cells must be offline at the moment ... Given: A&B - A&B Which of these happens next: A&B
        Message 3 of 4 , Mar 25, 2009
          Hi Raphaël,

          I'm sure I used to know all this but the relevant brain
          cells must be offline at the moment ...

          Given:
          A&B -> A&B

          Which of these happens next:
          A&B -> A&B -> A%26B
          or:
          A&B -> A&B -> A%26amp;B

          Can you think of somewhere where both sets of transforms
          (from logical string to XML and from XML to URI) are
          explained and illustrated?

          Thanks,
          Misha


          -----Original Message-----
          From: newsml-g2@yahoogroups.com [mailto:newsml-g2@yahoogroups.com] On Behalf Of Raphaël Troncy
          Sent: 25 March 2009 13:49
          To: newsml-g2@yahoogroups.com
          Cc: duerst@...; ishida@...
          Subject: Re: [newsml-g2] Encoding of strings for XML and for URIs

          Dear Misha,

          [snip]

          > We (the IPTC) are struggling to specify how and when one should escape
          > problematic chars when going from the logical string to XML and then to
          > URIs. Let us say we have the Code "B&F003000000RU" in the Scheme
          > "ISLC". We construct the QCode (rather like a QName)
          > "ISLC:B&F003000000RU". On the logical level we then have:
          >
          > <subject qcode="ISLC:B&F003000000RU" />
          >
          > Our rules say that we then append the Code to the Scheme URI, to create
          > the Code URI. If the Scheme URI is:
          >
          > http://www.example.org <http://www.example.org>#
          >
          > should the Code URI be:
          >
          > http://www.example.org# <http://www.example.org#B%26F003000000RU>
          > B%26F003000000RU <http://www.example.org#B%26F003000000RU>


          Yes, to be a legal URI, you would need to %-escape the ampersand. So
          your construction is the correct one, per RFC3986 [1]. You might want to
          use also IRIs per RFC3987 [2] for a non %-escaped version.
          Cheers.

          Raphaël

          [1] http://tools.ietf.org/html/rfc3986
          [2] http://tools.ietf.org/html/rfc3987

          --
          Raphaël Troncy
          CWI (Centre for Mathematics and Computer Science),
          Science Park 123, 1098 XG Amsterdam, The Netherlands
          e-mail: raphael.troncy@... & raphael.troncy@...
          Tel: +31 (0)20 - 592 4093
          Fax: +31 (0)20 - 592 4312
          Web: http://www.cwi.nl/~troncy/


          ------------------------------------

          Any member of this IPTC moderated Yahoo group must comply with the Intellectual Property Policy of the IPTC, available at http://www.iptc.org/goto/ipp. Any posting is assumed to be submitted under the conditions of this IPTC IP Policy.
          Yahoo! Groups Links





          This email was sent to you by Thomson Reuters, the global news and information company.
          Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters.
        • Raphaël Troncy
          Misha, In the context of a fragment, i.e. what is behind the hash in the URI, the relevant section of the RFC is http://tools.ietf.org/html/rfc3986#section-3.5
          Message 4 of 4 , Mar 25, 2009
            Misha,

            In the context of a fragment, i.e. what is behind the hash in the URI,
            the relevant section of the RFC is
            http://tools.ietf.org/html/rfc3986#section-3.5
            The reserved characters are in
            http://tools.ietf.org/html/rfc3986#section-2.2

            > Given:
            > A&B -> A&B
            >
            > Which of these happens next:
            > A&B -> A&B -> A%26B

            YES: http://www.example.org#A%26B is a valid URI

            > or:
            > A&B -> A&B -> A%26amp;B

            NO, that makes a different URI, and the ';' needs to be also escaped
            anyway.

            > Can you think of somewhere where both sets of transforms
            > (from logical string to XML and from XML to URI) are
            > explained and illustrated?

            RFC :-)
            More seriously, I have to look for, will come back if I find something.
            Cheers.

            Raphaël

            --
            Raphaël Troncy
            CWI (Centre for Mathematics and Computer Science),
            Science Park 123, 1098 XG Amsterdam, The Netherlands
            e-mail: raphael.troncy@... & raphael.troncy@...
            Tel: +31 (0)20 - 592 4093
            Fax: +31 (0)20 - 592 4312
            Web: http://www.cwi.nl/~troncy/
          Your message has been successfully submitted and would be delivered to recipients shortly.