Loading ...
Sorry, an error occurred while loading the content.

Good example of runaway double encoding

Expand Messages
  • Bill Kearney
    Hi folks, This Radio feed is a fine example of utterly run-away encoding problems: http://www.syndic8.com/feedinfo.php?FeedID=9052 Double-encoding,
    Message 1 of 5 , Oct 8, 2002
    • 0 Attachment
      Hi folks,

      This Radio feed is a fine example of utterly run-away encoding problems:
      http://www.syndic8.com/feedinfo.php?FeedID=9052

      Double-encoding, triple-encoding, use of HTML entities, ugh, it's quite a mess.

      It's doubtless the user doesn't know what's being done here. Can Radio take
      steps to fix this sort of thing? I've made this request /countless/ times
      before. I know this is the dirty grunt work nobody wants to do but it NEEDS
      attention.

      -Bill Kearney
    • Mark Paschal
      ... That feed looks fine to me. The HTML display is: Open the html file as text and copy the Chinese symbols (they look like this:
      Message 2 of 5 , Oct 9, 2002
      • 0 Attachment
        > Double-encoding, triple-encoding, use of HTML
        > entities, ugh, it's quite a mess.

        That feed looks fine to me. The HTML display is:

        "Open the html file as text and copy the Chinese symbols (they look like this:
        倚天屠龙记) into the Radio editor."

        so the HTML code is:

        "Open the html file as text and copy the Chinese symbols (they look like this:
        倚天屠龙记) into the Radio
        editor."

        so encoding the HTML CDATA into PCDATA yields:

        "Open the html file as text and copy the Chinese symbols (they look like this:
        倚天屠龙记)
        into the Radio editor."

        which is what appears in the RSS file. Feel free to argue HTML CDATA is
        undesirable, but that particular feed is correct RSS 0.92 as far as I can
        tell.



        --
        Mark Paschal
        http://markpasc.org/blog/
        markpasc@...
      • Bill Kearney
        Gotta disagree with you here Mark. Look at lines 106-108 of that XML. http://www.syndic8.com/feedinfo.php?FeedID=9052&Section=xml Double encoding of HTML
        Message 3 of 5 , Oct 9, 2002
        • 0 Attachment
          Gotta disagree with you here Mark.

          Look at lines 106-108 of that XML.
          http://www.syndic8.com/feedinfo.php?FeedID=9052&Section=xml

          Double encoding of HTML entities is bad. There IS a situation where doing so is
          needed. That's to express actual markup as text. Most users never need to do
          this.

          There's also the issue of using HTML entities inside and XML document without
          making note of such in the declaration. That's a whole other train wreck. One
          I'll avoid mentioning again.

          As to your examples, unless you NEED to show the encoding there's NO need to
          double-encode it. The question is whether you want to paste those into a
          WYSYWIG editor or a source editor. If you pasted that into a WYSYWIG editor
          you'd end up with the mess you describe. If you pasted it into the source Radio
          still mangles it. Notwithstanding Radio's unfriendly treatment of non-English
          languages. It's wrong to double-encode. And it puts extra burden onto the
          eventual display system. Not everything has a "clean up after user mistakes"
          display handler for HTML.

          As for wrapping in CDATA, there's no need to entity encode at all. It'd be
          doubly stupid to encode AND wrap in blocks.

          The argument here is that while it's possible to depend on browsers cleaning up
          the mistakes, there's no need to force them to do so if you can avoid mangling
          it at the source.

          -Bill Kearney

          ----- Original Message -----

          > > Double-encoding, triple-encoding, use of HTML
          > > entities, ugh, it's quite a mess.
          >
          > That feed looks fine to me. The HTML display is:
          >
          > "Open the html file as text and copy the Chinese symbols (they look like this:
          > 倚天屠龙记) into the Radio editor."
          >
          > so the HTML code is:
          >
          > "Open the html file as text and copy the Chinese symbols (they look like this:
          > 倚天屠龙记) into the Radio
          > editor."
          >
          > so encoding the HTML CDATA into PCDATA yields:
          >
          > "Open the html file as text and copy the Chinese symbols (they look like this:
          >
          倚天屠龙记
          )
          > into the Radio editor."
          >
          > which is what appears in the RSS file. Feel free to argue HTML CDATA is
          > undesirable, but that particular feed is correct RSS 0.92 as far as I can
          > tell.
        • Mark Paschal
          ... Ah, yeah, it was the other strange construction I saw last time. Sorry about that. ... I sure won t argue there! ... But isn t that the point of lines 106
          Message 4 of 5 , Oct 9, 2002
          • 0 Attachment
            > Look at lines 106-108 of that XML.

            Ah, yeah, it was the other strange construction I saw last time. Sorry about
            that.


            > Double encoding of HTML entities is bad.

            I sure won't argue there!


            > There's also the issue of using HTML entities
            > inside and XML document without making note of
            > such in the declaration. That's a whole other
            > train wreck. One I'll avoid mentioning again.

            But isn't that the point of lines 106 through 108? The "&" in " " should
            be escaped because the author intends the tag's contents after parsing to be
            ampersand n b s p semicolon. The only way to have " " as PCDATA and still
            have the intended HTML after parsing would be to define the XML entity
            " " as expanding to " ". I wouldn't call it double encoding just
            because XML and HTML's character entities use the same syntax.


            > As for wrapping in CDATA, there's no need to
            > entity encode at all. It'd be doubly stupid to
            > encode AND wrap in blocks.

            I agree. I thought this might be unclear after I reread my message (and read
            your similar post to rss-dev), so just to clarify: both in my previous message
            and above, by "CDATA" I mean PCDATA that's been parsed, not necessarily text
            inside a CDATA section as defined in section 2.7 of the XML spec. I thought
            "CDATA" was the name for that, as DTDs use the term as the appropriate
            alternative to PCDATA. Is there a different name I should be using instead?



            --
            Mark Paschal
            http://markpasc.org/blog/
            markpasc@...
          • Bill Kearney
            ok, now we re both completely confused. The issue of /properly/ entity encoding in XML is a whole other can of worms. It s /proper/ to declare use of
            Message 5 of 5 , Oct 9, 2002
            • 0 Attachment
              ok, now we're both completely confused. <grin/>

              The issue of /properly/ entity encoding in XML is a whole other can of worms.
              It's /proper/ to declare use of HTML entities in the doctype. That's a whole
              other headache. I'm not specifically addressing that mess here.

              I'm simply stating that to put &nbsp; is wrong. To do &amp;nbsp; is
              even worse. To mangle the UTF encoded ones is ever more heinous. Radio quite
              merrily commits all three of these sins.

              Encoding of entities isn't all that hard once you grasp it. It's the sort of
              grunt work that many programmers never seem to get around to doing PROPERLY.
              This is indicative of many things inside Radio. But in this case that laziness
              makes it harder for many other programs to decypher Radio's gibberish.

              -Bill Kearney

              ----- Original Message -----

              > > Look at lines 106-108 of that XML.
              >
              > Ah, yeah, it was the other strange construction I saw last time. Sorry about
              > that.
              >
              >
              > > Double encoding of HTML entities is bad.
              >
              > I sure won't argue there!
              >
              >
              > > There's also the issue of using HTML entities
              > > inside and XML document without making note of
              > > such in the declaration. That's a whole other
              > > train wreck. One I'll avoid mentioning again.
              >
              > But isn't that the point of lines 106 through 108? The "&" in " " should
              > be escaped because the author intends the tag's contents after parsing to be
              > ampersand n b s p semicolon. The only way to have " " as PCDATA and still
              > have the intended HTML after parsing would be to define the XML entity
              > " " as expanding to " ". I wouldn't call it double encoding just
              > because XML and HTML's character entities use the same syntax.
              >
              >
              > > As for wrapping in CDATA, there's no need to
              > > entity encode at all. It'd be doubly stupid to
              > > encode AND wrap in blocks.
              >
              > I agree. I thought this might be unclear after I reread my message (and read
              > your similar post to rss-dev), so just to clarify: both in my previous message
              > and above, by "CDATA" I mean PCDATA that's been parsed, not necessarily text
              > inside a CDATA section as defined in section 2.7 of the XML spec. I thought
              > "CDATA" was the name for that, as DTDs use the term as the appropriate
              > alternative to PCDATA. Is there a different name I should be using instead?
            Your message has been successfully submitted and would be delivered to recipients shortly.