Loading ...
Sorry, an error occurred while loading the content.

encoding / decoding RSS

Expand Messages
  • Chris Nandor
    OK, first, my problem: taking data from a Slash database and encoding it so it will be 1. legal XML, and 2. easily decoded back into something resembling the
    Message 1 of 1 , Feb 1, 2001
    View Source
    • 0 Attachment
      OK, first, my problem: taking data from a Slash database and encoding it so
      it will be 1. legal XML, and 2. easily decoded back into something
      resembling the original data. This is compounded by the problem that some
      people may put illegal data into the database to begin with (e.g., a lone
      "&" should not ever be in a title or description).

      Then, we want to decode the data reasonably.

      All of the above also assumes that we are encoding from and decoding to
      HTML. If a user of our RSS file wants to then run something like
      HTML::Entities::decode_entities() on the result, they can get a non-HTML
      version of it.

      The short of it is the program below. It will take some data, encode it
      for inclusion in an RSS file, then decode it to see what it would be on
      output. For example:

      Original:
      <em>I've "a" <a href="bio.html">"Bio"</a> && a
      <Résumé!></em>

      Encoded:
      <em>I've &quot;a&quot; <a
      href="bio.html">"Bio"</a> &amp;&amp; a
      &#x3c;R&eacute;sumé!&#x3E;</em>

      Decoded:
      <em>I've "a" <a href="bio.html">"Bio"</a> && a
      <Résumé!></em>

      Note that in the original, we have a character (e with an acute accent)
      that we want to have encoded. We want to preserve the < and >, but we
      don't want the < to become <, or the > to become >.

      Anyway, if you can, please follow the code and let me know any problems you
      have with our methods here. I realize I might not be very clear; it's been
      a long day. Let me know if I can clarify anything for you.

      Thanks,

      --Chris


      #!/usr/bin/perl -wl

      use strict;
      use XML::RSS; # includes XML::Parser::Expat

      my $text = <<EOT;
      <em>I've "a" <a href="bio.html">"Bio"</a> && a
      <Résumé!></em>
      EOT

      sub encode_text {
      my($text) = @_;

      # if there is an & that is not part of an entity, convert it
      # to &
      $text =~ s/&(?!#?[a-zA-Z0-9]+;)/&/g;

      # convert & < > to XML entities
      $text = XML::Parser::Expat->xml_escape($text, ">");

      # convert ASCII-non-printable to numeric entities
      $text =~ s/([^\s\040-\176])/ "&#" . ord($1) . ";" /ge;

      return $text;
      }

      {
      # for all following chars but &, convert entities back to
      # the actual character

      # for &, convert & back to &, but only if it is the
      # beginning of an entity (like "&#32;")

      # precompile these so we only do it once

      my %e = qw(< lt > gt " quot ' apos & amp);
      for my $chr (keys %e) {
      my $word = $e{$chr};
      my $ord = ord $chr;
      my $hex = sprintf "%x", $ord;
      $hex =~ s/([a-f])/[$1\U$1]/g;
      my $regex = qq/&(?:$word|#$ord|#[xX]$hex);/;
      $regex .= qq/(?=#?[a-zA-Z0-9]+;)/ if $chr eq "&";
      $e{$chr} = qr/$regex/;
      }

      sub decode_text {
      my($text) = @_;

      # do & only _after_ the others
      for my $chr ( (grep !/^&$/, keys %e), "&") {
      $text =~ s/$e{$chr}/$chr/g;
      }

      return $text;
      }
      }

      print $text;
      print $text = encode_text($text);
      print $text = decode_text($text);


      __END__

      --
      Chris Nandor pudge@... http://pudge.net/
      Open Source Development Network pudge@... http://osdn.com/
    Your message has been successfully submitted and would be delivered to recipients shortly.