Loading ...
Sorry, an error occurred while loading the content.
 

HTML Encoding BooHoo...

Expand Messages
  • Morbus Iff
    Ok. I d like some direction from the community: Is there any *singular* way (ie. across all versions of RSS/scriptingNews) that one should accept/encode HTML?
    Message 1 of 7 , May 23, 2001
      Ok. I'd like some direction from the community:

      Is there any *singular* way (ie. across all versions of
      RSS/scriptingNews) that one should accept/encode HTML?

      This comes because of the following which happened today:

      In "The XML Cover Pages" RSS feed, there's a <XML>
      bit in the title of one of his items. No big deal, at
      face value, that's the correct way to do it.

      In some other places, however, where HTML is valid, you have things like:

      Hey! <b>This is neat!</b>

      Now, in AmphetaDesk, if I see code like the "This is neat" above, I revert
      the entities to the HTML state, so that they'll be properly displayed to
      the browser:

      Hey! <b>This is neat!</b>

      At this point in time, I *do not* want to get into a discussion of the
      morality of HTML in RSS feeds and how the world is going to end. My point
      of view is that I flipping hate HTML in RSS feeds, but I've got to cope,
      just like everyone else. Moving on...

      The same "reversion of entities" affects the <XML> as well, making it
      <XML> as the final code to be sent for browser display. This, as some can
      guess, causes problems. Specifically, in IE 6b for Windows, it stops the
      browser display cold - IE thinks an XML document is on its way.

      Now, the "reversion of entities" code in my RSS reader doesn't know about
      HTML - it just blindly reverts < to < and so forth. Is the only solution
      to my problem to make the code understand all the possible HTML entities?
      Or is there something else?

      Blah. Thanks for listening.


      Morbus Iff
      .sig on other machine.
      http://www.disobey.com/
      http://www.gamegrene.com/
    • Julian Bond
      In article , ... There s a fair bit of code around that removes all tags except a subset of Allowable
      Message 2 of 7 , May 23, 2001
        In article <5.1.0.14.2.20010523113347.00a82050@...>,
        >The same "reversion of entities" affects the <XML> as well, making it
        ><XML> as the final code to be sent for browser display. This, as some can
        >guess, causes problems. Specifically, in IE 6b for Windows, it stops the
        >browser display cold - IE thinks an XML document is on its way.
        >
        >Now, the "reversion of entities" code in my RSS reader doesn't know about
        >HTML - it just blindly reverts < to < and so forth. Is the only solution
        >to my problem to make the code understand all the possible HTML entities?
        >Or is there something else?

        There's a fair bit of code around that removes all tags except a subset
        of "Allowable html". PHP even has this as a function built into the
        scripting language.

        I think the correct way to deal with this is for feed producers to:-
        - Only include html in <description>, not <title>
        - Escape all reserved characters in <description>
        For people who turn feeds into displayable code
        - Unescape all escaped reserved characters
        - Trim the tags to a sub set that you feel comfortable with for your
        display purposes.

        And that's it. There's some strange problem with "&" or is that "&&" or
        "&&" which I'll ignore for the moment.

        It's a SMOP. (Simple Matter Of Programming)

        --
        Julian Bond eMail: julian@...
        HomeURL: http://www.shockwav.demon.co.uk/
        WorkURL: http://www.netmarketseurope.com/
        WebLog: http://roguemoon.manilasites.com/
        M: +44 (0)77 5907 2173 T: +44 (0)20 7420 4363
        ICQ:33679668 tag:So many words, so little time
      • Morbus Iff
        ... Yes, but that wouldn t solve my above problem (** and see earlier message). In thiss case, <XML> wasn t a tag, it was part of the actual .
        Message 3 of 7 , May 23, 2001
          >>Now, the "reversion of entities" code in my RSS reader doesn't know about
          >>HTML - it just blindly reverts < to < and so forth. Is the only solution
          >>to my problem to make the code understand all the possible HTML entities?
          >>Or is there something else?
          >
          >There's a fair bit of code around that removes all tags except a subset
          >of "Allowable html". PHP even has this as a function built into the
          >scripting language.

          Yes, but that wouldn't solve my above problem (** and see earlier message).
          In thiss case, <XML> wasn't a tag, it was part of the actual <title>.
          Removing all HTML tags wouldn't affect the <XML>, cos that's not a valid
          HTML tag anyways... Right now, my reader:

          - loads in an XML file.
          - converts any encoded </>'s to </> (to cover encoded HTML).
          this is a mass replacement, which causes the above problem.

          Ultimately, I don't want to remove tags (that's not a decision I'm willing
          to make for the users of my program, but it will be an option that they can
          choose from).

          In this case, it's not even an issue of allowable tags or not - it's an
          issue of preparing for people correctly encoding HTML (<b>) and not
          encoding HTML (<b>).

          ** I eventually tracked the culprit to nothing in my code, but rather the
          XML::Simple perl module, which seems to magick <XML> into <XML> all
          by itself. I'm still investigating, but seeing the file encoded, and then
          loading it through XML::Simple and Data::Dump[ing] it shows that it's
          autoconverted. Why, I'm not sure...


          Morbus Iff
          .sig on other machine.
          http://www.disobey.com/
          http://www.gamegrene.com/
        • Dan Lyke
          ... I d treat it just like untrusted user-entered text: Do the entity encoding, then run through and find all unknown tags or tags with unacceptable attributes
          Message 4 of 7 , May 23, 2001
            Morbus Iff writes:
            > At this point in time, I *do not* want to get into a discussion of the
            > morality of HTML in RSS feeds and how the world is going to end. My point
            > of view is that I flipping hate HTML in RSS feeds, but I've got to cope,
            > just like everyone else.

            I'd treat it just like untrusted user-entered text: Do the entity
            encoding, then run through and find all unknown tags or tags with
            unacceptable attributes and convert them back to <xml> like
            escapes (HTML4 strict is a good list of things that won't let 'em
            screw up your display too bady), do the clean-up of the unclosed tags.

            You should probably also be looking out for common HTML coding
            mistakes and handling them.

            I've got Perl code to do this for my user comments if you want to
            steal from it.

            Dan
          • Mark Nottingham
            ... It does that because that s what it s supposed to do; XML processors must resolve entities automagically. If they want to be rendered by the final
            Message 5 of 7 , May 23, 2001
              On Wed, May 23, 2001 at 12:57:37PM -0400, Morbus Iff wrote:
              > ** I eventually tracked the culprit to nothing in my code, but rather the
              > XML::Simple perl module, which seems to magick <XML> into <XML> all
              > by itself. I'm still investigating, but seeing the file encoded, and then
              > loading it through XML::Simple and Data::Dump[ing] it shows that it's
              > autoconverted. Why, I'm not sure...

              It does that because that's what it's supposed to do; XML processors
              must resolve entities automagically. If they want "<XML>" to be
              rendered by the final browser, it should be encoded in the RSS feed
              as:

              &lt;XML&gt;
              so that it will come out of the XML parser as:
              <XML>
              which will be rendered by the HTML parser as:
              <XML>

              Practically, the best thing to do is probably scan and allow a
              pre-determined subset of HTML, and entity-encode everything else (as
              is suggested by Dan).

              Cheers,



              --
              Mark Nottingham
              http://www.mnot.net/
            • dave.cantrell@gunter.af.mil
              ... Maybe I m missing something here, but... Any reason why you can t scan the feed as plain text before passing it to the perl module, and replace all
              Message 6 of 7 , May 23, 2001
                >It does that because that's what it's supposed to do; XML processors
                >must resolve entities automagically. If they want "<XML>" to be
                >rendered by the final browser, it should be encoded in the RSS feed
                >as:
                >
                > &lt;XML&gt;
                >so that it will come out of the XML parser as:
                > <XML>
                >which will be rendered by the HTML parser as:
                > <XML>


                Maybe I'm missing something here, but...

                Any reason why you can't scan the feed as plain text before passing it to
                the perl module, and replace all instances of <XML> with
                &lt;XML>& etc? Variations can be done ad nauseum to get the
                effect you need.

                Seems everything is working for you except this one problem -- I'd focus on
                working around it before creating my own parser to handle HTML in the RSS
                feed. Then again, I don't really like the idea of HTML in the feed to begin
                with...

                Ref: http://c2.com/cgi/wiki?DoTheSimplestThingThatCouldPossiblyWork

                ;)

                ________________________________________________________
                SSgt Dave Cantrell, USAF
                Web Developer, Logistics Information Systems
                [DSN] 596.6277 [COM] 334.416.6277
                dave.cantrell@...
                https://web2.ssg.gunter.af.mil/IL (.mil/.gov only)
                --------------------------------------------------------
                We have the enemy surrounded. We are dug in and
                have overwhelming numbers. But enemy airpower is
                mauling us badly. We will have to withdraw.
                -- A Japanese infantry commander's
                situation report to HQ
                Burma, WWII
                --------------------------------------------------------
                This e-mail does not constitute endorsement of any
                product by the U.S. Air Force, nor can it be used to
                obligate the U.S. Air Force in any legal, financial,
                or contractual arrangement.
              • Morbus Iff
                ... No, not really :) ... ... That is ultimately what I planned to do, but that s a few versions away - I was hoping for a quick fix cos in the last four
                Message 7 of 7 , May 23, 2001
                  >Maybe I'm missing something here, but...

                  No, not really :) ...

                  >Any reason why you can't scan the feed as plain text before passing it to
                  >the perl module, and replace all instances of <XML> with
                  >&lt;XML>& etc? Variations can be done ad nauseum to get the
                  >effect you need.

                  That is ultimately what I planned to do, but that's a few versions away - I
                  was hoping for a "quick fix cos in the last four months, this is the first
                  time it's happened" sort of thing. But, in the future, there'd be three
                  choices of browser display: "show as intended", "show html
                  entities/encoded" and "strip html"... To do something like that, I'd have
                  to preparse it before handling it off to the XML parser. That's my plan, at
                  least.

                  I guess, at this point, the quick fix is to handle the <XML> thing only -
                  which (almost) in my head seems to quick of a fix <g>. But off the top of
                  my head, I can't think of anything else that would cause the same sort of
                  show-stopping problem that is currently happening with IE ("oop! hey! an
                  XML document! stop showing HTML and start parsing XML!")...

                  >Seems everything is working for you except this one problem -- I'd focus on

                  It seems that way - as mentioned, this is the first time the parser ran
                  across any show stopping error (another minor error include problems when
                  an unencoded HTML tag is up against an XML tag, like "<title><b>This is
                  a</b> title</title>" - this isn't really an error as opposed to "too
                  difficult to care about" and it only seems to happen with The Register's
                  feed every so often)...


                  Morbus Iff
                  .sig on other machine.
                  http://www.disobey.com/
                  http://www.gamegrene.com/
                Your message has been successfully submitted and would be delivered to recipients shortly.