Loading ...
Sorry, an error occurred while loading the content.

RE: Towards a TAG consideration of CURIEs

Expand Messages
  • Misha Wolf
    To add another element to the mix, the IPTC has defined, and will be using, QCodes (Qualified Codes) for the forthcoming News Architecture and all G2 standards
    Message 1 of 5 , Mar 30, 2007
    • 0 Attachment
      To add another element to the mix, the IPTC has defined, and will be
      using, QCodes (Qualified Codes) for the forthcoming News Architecture
      and all G2 standards based upon it: NewsML-G2, NITF-G2, SportsML-G2,
      EventsML-G2, etc.

      Some of you will recall my presentation to the W3C AC last year about
      this subject. We were then still considering the use of CURIEs but
      have since concluded that what we need are QNames without the
      restriction on the leading character of the right-hand side. So we
      created QCodes, which don't have this limitation.

      Henry pointed out at the Edinburgh AC meeting that if we used simple
      concatenation to give people access to information about terms in
      our taxonomies, we could end up with illegal fragment IDs. So:

      If:
      <subject qcode="iptc:123456"/>
      and if:
      iptc -> http://www.iptc.org/NewsCodes#
      and if we used simple concatenation, we'd get:
      iptc -> http://www.iptc.org/NewsCodes#123456

      There is, of course, the other option:
      iptc -> http://www.iptc.org/NewsCodes/
      then if we used simple concatenation, we'd get:
      iptc -> http://www.iptc.org/NewsCodes/123456

      We've decided to side-step this by specifying that the concatenation
      rules are taxonomy-specific and are up to the provider of each
      taxonomy.

      For us the bottom line is, as I said in Edinburgh, that we require
      tuples without some of the constraints that QNames took from XML.
      The construction of a URI pointing to useful info (and usable for
      RDF) we see as icing on the cake.

      We think that some Semantic Web specs and tools may gag on QCodes
      but this is where theory meets the real World.

      Misha Wolf
      News Standards Manager, Reuters
      http://www.iptc.org/ | http://www.iptc.org/NAR/


      -----Original Message-----
      From: www-tag-request@... [mailto:www-tag-request@...] On Behalf
      Of Henry S. Thompson
      Sent: 30 March 2007 12:06
      To: www-tag@...
      Subject: Towards a TAG consideration of CURIEs


      -----BEGIN PGP SIGNED MESSAGE-----
      Hash: SHA1

      I took an action at the last TAG telcon (minutes forthcoming) to try
      to draft a statement of where the current CURIE draft [1] (actually
      quotes below are from a Member-only editors' draft [1a], which
      contains some minor changes to the syntax) raises architectural
      issues.

      *Executive Summary*

      Do the expected benefits of CURIEs outweigh the potential costs in
      introducing a _third_ syntax for identifiers into the languages of the
      Web?

      *Background*

      XML Namespaces introduced the notion of expanded names, that is, names
      in the form of a pair of namespace name (possibly empty) and local
      name. It further introduced an abbreviation mechanism, involving
      prefixes and namespace declarations. The word 'QName' has come to be
      used for both the syntactic form such abbreviations take (i.e.
      (NCName ':')? NCName
      ) and the two-part name such abbreviations stand for. As such, QNames
      are clearly distinct from URIs. Their use as identifiers, however,
      immediately raises the question of their relationship with URIs.

      The TAG considered the use of QNames as shorthand for URIs in issue
      rdfmsQnameUriMapping-6 [2], and issued a finding on the the related
      subject of using QNames as names for things other than XML elements
      and attributes, called "Using Qualified Names (QNames) as Identifiers
      in XML Content" [3].

      In that finding, we find

      "We observe also that there is an overlap in the lexical space of
      QNames and URIs.

      "Specifications that use QNames to represent {URI, local-name}
      pairs SHOULD NOT allow both forms in attribute values or element
      content where they would be indistinguishable."

      and also

      "Where there is a compelling reason to use QNames instead of URIs
      for identification, it is imperative that specifications provide a
      mapping between QNames and URIs, if such a mapping is possible."

      The Architecture of the World Wide Web summarises these points in its
      section on QNames [4].

      *CURIEs*

      Unlike QNames, CURIEs are explicitly intended as abbreviations for
      URIs. None-the-less they use an extension of the syntax of QNames,
      namely
      NCName ':' [pretty unconstrained string]

      *Architectural issues*

      If the CURIE WD is eventually adopted, we will have three related
      forms of identification:

      1) URIs themselves;
      2) CURIEs as abbreviations of (absolute) URIs;
      3) QNames as abbreviations for expanded names, which in _some_
      circumstances are mapped by convention or explicit algorithm to
      URIs.

      In [3] and [4] the potential confusions of overlapping syntax and
      function arising from the use of QNames as identifiers and even as
      abbreviations for URIs is accepted as a fact with good historical and
      pragmatic motivations.

      The fundamental architectural question raised by the CURIE
      specification is then whether the expected benefits outweigh the
      potential costs in introducing a _third_ syntax for identifiers into
      the languages of the Web.

      A subsidiary question depends on exactly what the intended scope of
      application of this specification is -- if it is as widely-targetted
      as it appears to be, would it not be better to consider an addendum to
      the relevant RFCs, e.g. 3986 and 3987 [5] [6]?

      And finally, the question of how CURIE would integrate with the typing
      and typed-data manipulation facilities provided by W3C XML Schema and
      XPath 2.0/XSLT 2.0/XQuery also needs careful consideration.

      ht

      [1] http://www.w3.org/TR/2007/WD-curie-20070307/
      [1a] http://www.w3.org/MarkUp/Group/2007/ED-curie-20070322/
      [2] http://www.w3.org/2001/tag/issues.html?type=1#rdfmsQnameUriMapping-6
      [3] http://www.w3.org/2001/tag/doc/qnameids.html
      [4] http://www.w3.org/TR/webarch/#xml-qnames
      [5] http://www.ietf.org/rfc/rfc3986.txt
      [6] http://www.ietf.org/rfc/rfc3987.txt
      - --
      Henry S. Thompson, HCRC Language Technology Group, University of
      Edinburgh
      Half-time member of W3C Team
      2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
      Fax: (44) 131 650-4587, e-mail: ht@...
      URL: http://www.ltg.ed.ac.uk/~ht/
      [mail really from me _always_ has this .sig -- mail without it is forged
      spam]
      -----BEGIN PGP SIGNATURE-----
      Version: GnuPG v1.2.6 (GNU/Linux)

      iD8DBQFGDO8ckjnJixAXWBoRAguVAJ9bRp253y37UMuZwxyTQ07o+60NswCePQpe
      1r3jL/PRXK5J7Gaz63u1diA=
      =c22I
      -----END PGP SIGNATURE-----



      This email was sent to you by Reuters, the global news and information company.
      To find out more about Reuters visit www.about.reuters.com

      Any views expressed in this message are those of the individual sender,
      except where the sender specifically states them to be the views of Reuters Limited.

      Reuters Limited is part of the Reuters Group of companies, of which Reuters Group PLC is the ultimate parent company.
      Reuters Group PLC - Registered office address: The Reuters Building, South Colonnade, Canary Wharf, London E14 5EP, United Kingdom
      Registered No: 3296375
      Registered in England and Wales
    • Misha Wolf
      Hi David, The situation is as follows. The IPTC s first priority is B2B News interchange. Support for painless discovery of additional information about
      Message 2 of 5 , Apr 6 4:32 PM
      • 0 Attachment
        Hi David,

        The situation is as follows.

        The IPTC's first priority is B2B News interchange. Support for
        painless discovery of additional information about Taxonomies and
        for the integration of News Taxonomies with the Semantic Web are
        desirable goals but, for the IPTC, they come second. This
        prioritisation relates both to the importance we attach to each
        aspect, and to the order in which we are tackling them.

        Now, the whole business of URI construction from tuples is a bit
        of mess. XML Namespaces don't mandate such a mechanism. RDF does
        require it, but the situation is difficult if many of the codes
        start with a digit.

        Given a code such as "123456", and given that we refuse to change
        the code to, say, "_123456", the main legal choices before us
        appear to be:

        1. Simple concatenation using "/" as the delimiter
        "http://www.iptc.org/NewsCodes/" & "123456" ->
        "http://www.iptc.org/NewsCodes/123456"

        2. Simple concatenation using "#_" as the delimiter
        "http://www.iptc.org/NewsCodes#_" & "123456" ->
        "http://www.iptc.org/NewsCodes#_123456"

        3. Concatenation using "#_" as the delimiter, where the "_" is
        glue, mandated by the relevant specification
        "http://www.iptc.org/NewsCodes#" & "_" & "123456" ->
        "http://www.iptc.org/NewsCodes#_123456"

        4. Concatenation using "#_" as the delimiter, where the "#_" is
        glue, mandated by the relevant specification
        "http://www.iptc.org/NewsCodes" & "#_" & "123456" ->
        "http://www.iptc.org/NewsCodes#_123456"

        As we would very strongly prefer to end up with a Web page per
        Taxonomy, containing a descriptive entry per concept, where the
        constructed URI results in the relevant entry, we are not
        enthusiastic about option 1.

        That appears to leave options 2, 3 and 4. We have felt uneasy
        about choosing between them without considered advice from the
        SemWeb community.

        As we are freezing the XML Schema for the News Architecture for our
        G2 Standards next week, and hope to ratify it at our AGM in Tokyo
        next month, this is an excellent time to consider and resolve the
        question of how to build URIs for Taxonomies used in News.

        We would very much welcome your input.

        Misha Wolf
        News Standards Manager, Reuters, http://www.reuters.com/
        Vice Chair, News Architecture WP, IPTC, http://www.iptc.org/


        -----Original Message-----
        From: Booth, David (HP Software - Boston) [mailto:dbooth@...]
        Sent: 06 April 2007 17:12
        To: Misha Wolf; Henry S. Thompson; www-tag@...;
        newsml-g2@yahoogroups.com
        Subject: RE: Towards a TAG consideration of CURIEs

        > From: www-tag-request@... [mailto:www-tag-request@...]
        > . . .
        > Henry pointed out at the Edinburgh AC meeting that if we used simple
        > concatenation to give people access to information about terms in
        > our taxonomies, we could end up with illegal fragment IDs. So:
        >
        > If:
        > <subject qcode="iptc:123456"/>
        > and if:
        > iptc -> http://www.iptc.org/NewsCodes#
        > and if we used simple concatenation, we'd get:
        > iptc -> http://www.iptc.org/NewsCodes#123456
        >
        > There is, of course, the other option:
        > iptc -> http://www.iptc.org/NewsCodes/
        > then if we used simple concatenation, we'd get:
        > iptc -> http://www.iptc.org/NewsCodes/123456
        >
        > We've decided to side-step this by specifying that the concatenation
        > rules are taxonomy-specific and are up to the provider of each
        > taxonomy.

        So any URI-based program using multiple taxonomies must have special
        concatenation rules built in for *each* taxonomy? That sounds awful.
        Was there some reason why the group could at least recommend that the
        namespace part end with either "/" or "#" (along with corresponding
        constraints on the local part)?

        David Booth, Ph.D.
        HP Software
        +1 617 629 8881 office | dbooth@...
        http://www.hp.com/go/software


        This email was sent to you by Reuters, the global news and information company.
        To find out more about Reuters visit www.about.reuters.com

        Any views expressed in this message are those of the individual sender,
        except where the sender specifically states them to be the views of Reuters Limited.

        Reuters Limited is part of the Reuters Group of companies, of which Reuters Group PLC is the ultimate parent company.
        Reuters Group PLC - Registered office address: The Reuters Building, South Colonnade, Canary Wharf, London E14 5EP, United Kingdom
        Registered No: 3296375
        Registered in England and Wales
      • John Cowan
        ... Is that really sensible when taxonomies are very large? Consider SNOMED-CT, with upwards of 300,000 terms. I should think that the choice of / vs. #
        Message 3 of 5 , Apr 6 4:46 PM
        • 0 Attachment
          Misha Wolf scripsit:

          > As we would very strongly prefer to end up with a Web page per
          > Taxonomy,

          Is that really sensible when taxonomies are very large? Consider
          SNOMED-CT, with upwards of 300,000 terms. I should think that
          the choice of / vs. # should be allowed to depend on the taxonomy
          in use.

          --
          John Cowan http://ccil.org/~cowan cowan@...
          Economists were put on this planet to make astrologers look good.
          --Leo McGarry
        • Misha Wolf
          ... Well, there are two options for URI construction: a) use simple concatenation of taxonomy URI and code, b) require that a specified string be injected
          Message 4 of 5 , Apr 7 8:17 AM
          • 0 Attachment
            John Cowan wrote:

            > Misha Wolf scripsit:
            >
            > > As we would very strongly prefer to end up with a Web page per
            > > Taxonomy,
            >
            > Is that really sensible when taxonomies are very large? Consider
            > SNOMED-CT, with upwards of 300,000 terms. I should think that
            > the choice of / vs. # should be allowed to depend on the taxonomy
            > in use.

            Well, there are two options for URI construction:

            a) use simple concatenation of taxonomy URI and code,

            b) require that a specified string be injected between the taxonomy
            URI and the code.

            I agree with David Booth that consuming programs shouldn't have to
            contain hardwired knowledge of the rules for each taxonomy. I'm not
            sure, though, that there exists a viable mechanism for telling a
            program which of the above to do, for each of the hundreds of
            taxonomies used for News. I haven't looked at GRDDL for some time,
            but I seem to recall that it is designed for interpreting document
            instances, so is probably not the right tool for specifying how to
            handle a taxonomy that will be used by millions of documents. I
            also don't recall such a capability in RDDL, though I haven't looked
            at it, too, for quite some time.

            So if we limited ourselves to one rule only, and if we wanted to
            support the use of both "#" and "/", we would probbaly have to go
            for simple concatenation and specify that in cases where any of the
            codes would not be legal fragment IDs, the taxonomy URI must end
            with a character which will sanitise the code. This approach is
            illustrated by choices 1 and 2 in my previous mail:

            1. Simple concatenation using "/" as the delimiter
            "http://www.iptc.org/NewsCodes/" & "123456" ->
            "http://www.iptc.org/NewsCodes/123456"

            2. Simple concatenation using "#_" as the delimiter
            "http://www.iptc.org/NewsCodes#_" & "123456" ->
            "http://www.iptc.org/NewsCodes#_123456"

            One of the disadvantages is that a number of RDF tools can't cope
            with choice 2. At any rate, this seemed to be the case when I last
            looked into this matter.

            Misha Wolf
            News Standards Manager, Reuters, http://www.reuters.com/
            Vice Chair, News Architecture WP, IPTC, http://www.iptc.org/

            This email was sent to you by Reuters, the global news and information company.
            To find out more about Reuters visit www.about.reuters.com

            Any views expressed in this message are those of the individual sender,
            except where the sender specifically states them to be the views of Reuters Limited.

            Reuters Limited is part of the Reuters Group of companies, of which Reuters Group PLC is the ultimate parent company.
            Reuters Group PLC - Registered office address: The Reuters Building, South Colonnade, Canary Wharf, London E14 5EP, United Kingdom
            Registered No: 3296375
            Registered in England and Wales
          • John Cowan
            ... Or for that matter just _ . I *never* understood why that was such a problem. ... +1 -- Where the wombat has walked, John Cowan
            Message 5 of 5 , Apr 12 9:34 PM
            • 0 Attachment
              Booth, David (HP Software - Boston) scripsit:

              > Would it be feasible to mandate a particular prefix as part of all
              > taxonomy IDs, such as "code:"? For example:

              Or for that matter just "_". I *never* understood why that was
              such a problem.

              > I know you (or someone else) mentioned that publishers do not want to
              > modify their existing codes, but something like this would be easy for
              > both human and machine to syntactically distinguish from the original
              > codes ("12345" or "foo"). In that sense the prefix seems conceptually
              > no different from other XML syntax that surrounds the original codes and
              > must be parsed away to retrieve the original codes.

              +1

              --
              Where the wombat has walked, John Cowan <cowan@...>
              it will inevitably walk again. http://www.ccil.org/~cowan
            Your message has been successfully submitted and would be delivered to recipients shortly.