Loading ...
Sorry, an error occurred while loading the content.

data-types-url

Expand Messages
  • Sam Ruby
    To avoid the necessity of chasing around multiple specifications and interpreting things based on what isn t said, a simple statement should be added to the
    Message 1 of 8 , Feb 28, 2006
    • 0 Attachment
      To avoid the necessity of chasing around multiple specifications and
      interpreting things based on what isn't said, a simple statement should
      be added to the data-types-url section of the spec:

      IRIs MUST be converted to URIs before being included in an RSS 2.0
      document.

      Perhaps with a hypertext link to
      http://www.apps.ietf.org/rfc/rfc3987.html#sec-3.2

      Background: the domain name in URIs have always been based on the
      US-centric ASCII character set. Understandably, there has been a
      growing demand for domain names which include characters which are
      present in non-English languages. From RFC 3987:

      The characters in URIs are frequently used for representing words of
      natural languages. This usage has many advantages: Such URIs are
      easier to memorize, easier to interpret, easier to transcribe, easier
      to create, and easier to guess. For most languages other than
      English, however, the natural script uses characters other than A -
      Z. For many people, handling Latin characters is as difficult as
      handling the characters of other scripts is for those who use only
      the Latin alphabet. Many languages with non-Latin scripts are
      transcribed with Latin letters. These transcriptions are now often
      used in URIs, but they introduce additional ambiguities.

      As an example, see:

      http://www.atemschutzunfälle.de/asu.rdf

      Despite the rdf extension, this actually is a valid RSS 0.93 feed.

      Based on concerns of breaking existing software, the way this was
      approached was in two phases. RFC 3743 specifies a backwards compatible
      metchanism for Internationalizing Domain Names in Applications. It
      involves encoding the non-ASCII characters in a special way. The domain
      name above, which contains an umlaut, gets encoded thus:
      www.xn--atemschutzunflle-7nb.de

      The other way forward was captured in RFC 3987, and it allows such
      characters to be included directly into IRIs. Quoting from that RFC:

      a. A protocol or format element should be explicitly designated to
      be able to carry IRIs. The intent is not to introduce IRIs into
      contexts that are not defined to accept them. For example, XML
      schema [XMLSchema] has an explicit type "anyURI" that includes
      IRIs and IRI references. Therefore, IRIs and IRI references can
      be in attributes and elements of type "anyURI". On the other
      hand, in the HTTP protocol [RFC2616], the Request URI is defined
      as a URI, which means that direct use of IRIs is not allowed in
      HTTP requests.

      Including IRIs as the url attribute of enclosure elements would quite
      likely break existing software. As that was not the intent of IRIs, any
      IRIs need to be mapped to an URI first.

      Again, I don't think all this background needs to be included in the
      spec, but a simple statement like the one suggested above would be
      appropriate.

      Test cases:

      http://feedvalidator.org/testcases/rss20/data-types-url/

      - Sam Ruby
    • Andy Henderson
      ... I would say the background is very important. The simple statement would have meant nothing to me. Andy
      Message 2 of 8 , Feb 28, 2006
      • 0 Attachment
        Sam Ruby wrote:
        > Again, I don't think all this background needs to be included in the
        > spec, but a simple statement like the one suggested above would be
        > appropriate.

        I would say the background is very important. The simple statement would
        have meant nothing to me.

        Andy
      • rcade
        ... Is there a workaround that RSS publishers in other languages are using so that they may use IRIs as URLs in RSS, or are they simply forced to employ URLs
        Message 3 of 8 , Feb 28, 2006
        • 0 Attachment
          --- In rss-public@yahoogroups.com, Sam Ruby <rubys@...> wrote:
          > Again, I don't think all this background needs to be included in the
          > spec, but a simple statement like the one suggested above would be
          > appropriate.

          Is there a workaround that RSS publishers in other languages are using
          so that they may use IRIs as URLs in RSS, or are they simply forced to
          employ URLs with the anglicized character set?
        • Sam Ruby
          ... Unless you are setting out to change what RSS 2.0 is, one need only look at the baseline 2.0.1-rv-6 (a.k.a. Harvard ) spec for the enclosure element,
          Message 4 of 8 , Feb 28, 2006
          • 0 Attachment
            rcade wrote:
            > --- In rss-public@yahoogroups.com, Sam Ruby <rubys@...> wrote:
            >
            >>Again, I don't think all this background needs to be included in the
            >>spec, but a simple statement like the one suggested above would be
            >>appropriate.
            >
            > Is there a workaround that RSS publishers in other languages are using
            > so that they may use IRIs as URLs in RSS, or are they simply forced to
            > employ URLs with the anglicized character set?

            Unless you are setting out to change what RSS 2.0 is, one need only look
            at the baseline 2.0.1-rv-6 (a.k.a. "Harvard") spec for the enclosure
            element, which says quite simply and clearly "The url must be an http
            url". Given this, one could say that "they simply [are] forced to
            employ URLs with the anglicized character set"(*).

            While that SOUNDS bad, in practice it is not. There is a clear and
            reversible (for all but some pesky edge cases of no consequence) mapping
            from IRIs to URIs. And all this is handled transparently by some browsers.

            Try entering either http://www.atemschutzunfälle.de/asu.rdf or
            http://www.xn--atemschutzunflle-7nb.de/asu.rdf in the Feed Validator.
            Either way, you will get the same results. In the validation results,
            you will see the "human friendly" version in the input field. If you
            look at the text link at the bottom of the page, you will see the
            internal or "IDNA" version, one that is completely acceptable to all
            HTTP stacks, and conforms to the RSS 2.0 specification.

            - Sam Ruby

            (*) Note that I am talking about the "host" portion of the URI here.
            Non-ASCII characters may be percent encoded and included in other
            portions of the URI, for example, inside a query string.
          • rcade
            ... I think that the following sentence in data-types-urls serves the same purpose without sounding like a new requirement for RSS implementers: These elements
            Message 5 of 8 , Mar 1, 2006
            • 0 Attachment
              --- In rss-public@yahoogroups.com, Sam Ruby <rubys@...> wrote:
              >IRIs MUST be converted to URIs before being included in an RSS 2.0
              >document.

              I think that the following sentence in data-types-urls serves the same
              purpose without sounding like a new requirement for RSS implementers:

              These elements MUST NOT contain IRIs.

              The word "IRIs" could link to http://www.apps.ietf.org/rfc/rfc3987.html.

              Implementers who are conversant with IRIs would know this means a
              conversion to URLs is necessary in order to be compliant with Really
              Simple Syndication.

              This wouldn't be a change because 2.0.1-rv-6 requires URLs, and IRIs
              are not URLs.
            • Sam Ruby
              ... I am fine with that wording, Now lets look at how these two suggestions can be compbined. These elements MUST NOT contain IRIs. IRIs MUST be converted to
              Message 6 of 8 , Mar 1, 2006
              • 0 Attachment
                rcade wrote:
                > --- In rss-public@yahoogroups.com, Sam Ruby <rubys@...> wrote:
                >
                >>IRIs MUST be converted to URIs before being included in an RSS 2.0
                >>document.
                >
                > I think that the following sentence in data-types-urls serves the same
                > purpose without sounding like a new requirement for RSS implementers:
                >
                > These elements MUST NOT contain IRIs.
                >
                > The word "IRIs" could link to http://www.apps.ietf.org/rfc/rfc3987.html.
                >
                > Implementers who are conversant with IRIs would know this means a
                > conversion to URLs is necessary in order to be compliant with Really
                > Simple Syndication.
                >
                > This wouldn't be a change because 2.0.1-rv-6 requires URLs, and IRIs
                > are not URLs.

                I am fine with that wording, Now lets look at how these two suggestions
                can be compbined.

                These elements MUST NOT contain IRIs. IRIs MUST be converted to
                URIs before being included in an RSS 2.0 document.

                The first sentence sounds like "note to non-English people: you are
                screwed". The second sentence says "no you are not, here's a path
                forward, complete with a helpful link to section 3.2 of RFC 3987 which
                tells you what you need to do".

                But however you chose to word it is fine with me.

                - Sam Ruby
              • Sam Ruby
                ... Upon further reflection, that sentence is misleading. The set of valid IRIs is a proper set supersets of the set of valid URIs. So disallowing IRIs would
                Message 7 of 8 , Mar 1, 2006
                • 0 Attachment
                  Sam Ruby wrote:
                  > rcade wrote:
                  >
                  >>--- In rss-public@yahoogroups.com, Sam Ruby <rubys@...> wrote:
                  >>
                  >>>IRIs MUST be converted to URIs before being included in an RSS 2.0
                  >>>document.
                  >>
                  >>I think that the following sentence in data-types-urls serves the same
                  >>purpose without sounding like a new requirement for RSS implementers:
                  >>
                  >>These elements MUST NOT contain IRIs.
                  >
                  > I am fine with that wording,

                  Upon further reflection, that sentence is misleading.

                  The set of valid IRIs is a proper set supersets of the set of valid
                  URIs. So disallowing IRIs would disallow URIs.

                  The process defined for convering an IRI which is already a URI to a URI
                  is a no-op.

                  - Sam Ruby
                • A. Pagaltzis
                  ... I think the correct wording for the spec would be that IRIs with non-ASCII characters MUST be given in their punycode-encoded URI representation. Regards,
                  Message 8 of 8 , Mar 1, 2006
                  • 0 Attachment
                    * Sam Ruby <rubys@...> [2006-03-01 13:15]:
                    >The set of valid IRIs is a proper set supersets of the set of
                    >valid URIs. So disallowing IRIs would disallow URIs.

                    I think the correct wording for the spec would be that IRIs with
                    non-ASCII characters MUST be given in their punycode-encoded URI
                    representation.

                    Regards,
                    --
                    Aristotle Pagaltzis // <http://plasmasturm.org/>
                  Your message has been successfully submitted and would be delivered to recipients shortly.