Loading ...
Sorry, an error occurred while loading the content.
 

syndication and i18n

Expand Messages
  • Mark Nottingham
    So, I ve been spending spare moments here and there putting together an aggregator for some time, in Python. I ve never written an internationalized app
    Message 1 of 8 , May 21, 2001
      So, I've been spending spare moments here and there putting together
      an aggregator for some time, in Python. I've never written an
      internationalized app before, and, wanting to do The Right Thing, I
      thought I'd give it a try, especially seeing as how Python 2.x
      supports Unicode.

      I may have bitten off more than I can chew.

      It seems that the permutations of:
      - source XML charset declaration,
      - actual character content of the XML, and
      - browser's desired charset
      are overwhelming.

      Many feeds occasionally have characters that pop through unescaped,
      such as single-quotes from Windows, etc.

      Currently, my strategy is to .encode('utf-8') EVERYTHING that comes
      in, and write that out (if you mix encodings in certain ways, Python
      doesn't like it). This works, but it doesn't seem too friendly to
      double-byte feeds or users, who I assume would be out of luck.

      Questions;
      - should I emit 'utf-8' in the appropriate HTTP headers to make
      browsers do the right thing?

      - In python, are there ways to:
      - determine what encoding an XML document uses (from SAX)
      - determine what encoding an arbitrary string is in

      - Does the above strategy doom double-byte users?

      - How does one deal with creating an HTML page from XML feeds which
      have potentially radically different charsets (i.e., ASCII and
      double-byte chinese on the same page)?

      - Does anybody know of some Cantonese RSS feeds for testing? ;)

      - How does one catch and deal with illegal characters in the XML
      source (SAX2)?

      Regards,

      --
      Mark Nottingham
      http://www.mnot.net/
    • James Carlyle
      Mark ... Don t know of any Cantonese RSS, but here are some Japanese ones in different encodings: http://www.hyuki.com/tf/tf.xml
      Message 2 of 8 , May 22, 2001
        Mark

        > - Does anybody know of some Cantonese RSS feeds for testing? ;)

        Don't know of any Cantonese RSS, but here are some Japanese ones in
        different encodings:

        http://www.hyuki.com/tf/tf.xml
        <?xml version="1.0" encoding="Shift_JIS"?>

        Here are a number of feeds:
        http://www.nsg.co.jp/nbb/ss/joy/memordf.html

        e.g. http://www.nsg.co.jp/nbb/ss/joy/idxxml.rss0.9.xml
        <?xml version="1.0" encoding="UTF-8" standalone="yes"?>


        Regards,

        James Carlyle
        http://xmltree.com
      • Aaron Swartz
        ... Why wouldn t you? I d also stick it in the declaration. ... Pick a superset of all the encodings (UCS-2?). -- [ Aaron Swartz | me@aaronsw.com |
        Message 3 of 8 , May 22, 2001
          Mark Nottingham <mnot@...> wrote:

          > - should I emit 'utf-8' in the appropriate HTTP headers to make
          > browsers do the right thing?

          Why wouldn't you? I'd also stick it in the <?xml?> declaration.

          > - How does one deal with creating an HTML page from XML feeds which
          > have potentially radically different charsets (i.e., ASCII and
          > double-byte chinese on the same page)?

          Pick a superset of all the encodings (UCS-2?).

          --
          [ Aaron Swartz | me@... | http://www.aaronsw.com ]
        • hpyle@agora.co.uk
          ... UTF-8 will handle everything, mixed languages and such. That s one of the reasons all XML parsers have to handle UTF8 properly. Since HTML doesn t have
          Message 4 of 8 , May 22, 2001
            > > - How does one deal with creating an HTML page from XML feeds which
            > >  have potentially radically different charsets (i.e., ASCII and
            > >  double-byte chinese on the same page)?
            >

            > Pick a superset of all the encodings (UCS-2?).

            UTF-8 will handle everything, mixed languages and such.  That's one of the reasons all XML parsers have to handle UTF8 properly.

            Since HTML doesn't have the XML's <?xml?> declaration, I think you probably have to say it's UTF8 in the headers.  (is that right?)

            My take: use a decent XML parser and you'll have all the parse-side encoding issues completely handled for you, and your Python code will just see Unicode.  It might mean you end up with a stricter aggregator than some (eg. you won't be able to accept <item>stuff<img src="something"></item> because it's badly formed), but IMHO that's not a bad thing.


            -Hugh

            hpyle@...       | +44 (0)20 8783 3592
            http://www.agora.co.uk/ | http://groovelog.agora.co.uk/  | http://rendezvoo.net/

          • Mark Nottingham
            ... That s what I m already doing; unfortunately, it s not that easy in practice, because unicode handling (in Python, at least) isn t that transparent. For
            Message 5 of 8 , May 22, 2001
              On Tue, May 22, 2001 at 05:01:31PM +0100, hpyle@... wrote:
              >
              > My take: use a decent XML parser and you'll have all the parse-side
              > encoding issues completely handled for you, and your Python code will just
              > see Unicode. It might mean you end up with a stricter aggregator than
              > some (eg. you won't be able to accept <item>stuff<img
              > src="something"></item> because it's badly formed), but IMHO that's not a
              > bad thing.

              That's what I'm already doing; unfortunately, it's not that easy in
              practice, because unicode handling (in Python, at least) isn't that
              transparent. For example, there are non-ASCII characters in both the
              Standard and the W3C's RSS feeds right now, which cause Python to
              raise an error unless I .encode('utf-8') them into strings.
              Parse-side isn't a problem; it's doing something with the output that
              is.

              For those interested in the minute details...

              In the W3C feed, the source HTML (the home page) is charset=us-ascii,
              and the offending bit of markup is encoded:
              Philippe Le Hégaret
              which renders fine in Mozilla.

              In the XML RSS file, the XML has an encoding of 'utf-8', and the
              offending markup is:
              Philippe Le Hégaret

              So, PyXML will spit this out as unicode. If I try to print that to
              anything, or combine it with other strings in certain ways, I get
              UnicodeError: ASCII encoding error: ordinal not in range(128)
              unless I .encode('utf-8') it, in which case I get something that
              prints in ascii as
              Philippe Le Hégaraet

              which seems to render correctly, as long as I set the charset to
              utf-8. Fine.

              The Standard's feed has encoding="ISO-8859-1". The offending markup
              is
              Net 21 <96> The Survivors
              which, as a Python unicode string, looks like
              u'Net 21 \x96 The Survivors'

              If I .encode('utf-8') it, I get
              'Net 21 \xc2\x96 The Survivors' \
              which doesn't look correct at all (it's supposed to be an em
              dash) when rendered in Mozilla with utf-8. If I change the charset to
              8859-1, the original renders correctly, but the unicode-encoded
              string does not (it has an extra character prepended, understandably).

              I think the root of the problem is that I have no apparent way to
              determine the encoding of a unicode string coming out of the XML
              parser, or a way to consolidate several different encodings into one
              document (although I thought this was what unicode was supposed to
              enable).

              I should probably take this to the Python XML group...

              --
              Mark Nottingham
              http://www.mnot.net/
            • dave.cantrell@gunter.af.mil
              ... Just out of curiosity, where is the W3C RSS feed? I searched their site with no luck. Is it a live feed or an example?
              Message 6 of 8 , May 22, 2001
                >In the W3C feed(...)

                Just out of curiosity, where is the W3C RSS feed? I searched their site with
                no luck. Is it a live feed or an example?

                ________________________________________________________
                SSgt Dave Cantrell, USAF
                Web Developer, Logistics Information Systems
                [DSN] 596.6277 [COM] 334.416.6277
                dave.cantrell@...
                https://web2.ssg.gunter.af.mil/IL (.mil/.gov only)
                --------------------------------------------------------
                We have the enemy surrounded. We are dug in and
                have overwhelming numbers. But enemy airpower is
                mauling us badly. We will have to withdraw.
                -- A Japanese infantry commander's
                situation report to HQ
                Burma, WWII
                --------------------------------------------------------
                This e-mail does not constitute endorsement of any
                product by the U.S. Air Force, nor can it be used to
                obligate the U.S. Air Force in any legal, financial,
                or contractual arrangement.
              • Rael Dornfest
                Howdy, It s at: http://www.w3.org/2000/08/w3c-synd/home.rss but it seems to be horribly broken at this moment, spewing Java exceptions. Rael
                Message 7 of 8 , May 22, 2001
                  Howdy,

                  It's at:

                  http://www.w3.org/2000/08/w3c-synd/home.rss

                  but it seems to be horribly broken at this moment, spewing Java exceptions.

                  Rael

                  : -----Original Message-----
                  : From: dave.cantrell@... [mailto:dave.cantrell@...]
                  : Sent: Tuesday, May 22, 2001 10:25 AM
                  : To: syndication@yahoogroups.com
                  : Subject: RE: [syndication] Re: syndication and i18n
                  :
                  :
                  : >In the W3C feed(...)
                  :
                  : Just out of curiosity, where is the W3C RSS feed? I searched
                  : their site with
                  : no luck. Is it a live feed or an example?
                  :
                  : ________________________________________________________
                  : SSgt Dave Cantrell, USAF
                  : Web Developer, Logistics Information Systems
                  : [DSN] 596.6277 [COM] 334.416.6277
                  : dave.cantrell@...
                  : https://web2.ssg.gunter.af.mil/IL (.mil/.gov only)
                  : --------------------------------------------------------
                  : We have the enemy surrounded. We are dug in and
                  : have overwhelming numbers. But enemy airpower is
                  : mauling us badly. We will have to withdraw.
                  : -- A Japanese infantry commander's
                  : situation report to HQ
                  : Burma, WWII
                  : --------------------------------------------------------
                  : This e-mail does not constitute endorsement of any
                  : product by the U.S. Air Force, nor can it be used to
                  : obligate the U.S. Air Force in any legal, financial,
                  : or contractual arrangement.
                  :
                  :
                  :
                  : Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
                  :
                  :
                • Jeff Barr
                  It is working now. Jeff; ... From: Rael Dornfest [mailto:rael@oreilly.com] Sent: Tuesday, May 22, 2001 10:33 AM To: syndication@yahoogroups.com Subject: RE:
                  Message 8 of 8 , May 22, 2001
                    It is working now.

                    Jeff;

                    -----Original Message-----
                    From: Rael Dornfest [mailto:rael@...]
                    Sent: Tuesday, May 22, 2001 10:33 AM
                    To: syndication@yahoogroups.com
                    Subject: RE: [syndication] Re: syndication and i18n


                    Howdy,

                    It's at:

                    http://www.w3.org/2000/08/w3c-synd/home.rss

                    but it seems to be horribly broken at this moment, spewing Java exceptions.

                    Rael

                    : -----Original Message-----
                    : From: dave.cantrell@... [mailto:dave.cantrell@...]
                    : Sent: Tuesday, May 22, 2001 10:25 AM
                    : To: syndication@yahoogroups.com
                    : Subject: RE: [syndication] Re: syndication and i18n
                    :
                    :
                    : >In the W3C feed(...)
                    :
                    : Just out of curiosity, where is the W3C RSS feed? I searched
                    : their site with
                    : no luck. Is it a live feed or an example?
                    :
                    : ________________________________________________________
                    : SSgt Dave Cantrell, USAF
                    : Web Developer, Logistics Information Systems
                    : [DSN] 596.6277 [COM] 334.416.6277
                    : dave.cantrell@...
                    : https://web2.ssg.gunter.af.mil/IL (.mil/.gov only)
                    : --------------------------------------------------------
                    : We have the enemy surrounded. We are dug in and
                    : have overwhelming numbers. But enemy airpower is
                    : mauling us badly. We will have to withdraw.
                    : -- A Japanese infantry commander's
                    : situation report to HQ
                    : Burma, WWII
                    : --------------------------------------------------------
                    : This e-mail does not constitute endorsement of any
                    : product by the U.S. Air Force, nor can it be used to
                    : obligate the U.S. Air Force in any legal, financial,
                    : or contractual arrangement.
                    :
                    :
                    :
                    : Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
                    :
                    :




                    Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
                  Your message has been successfully submitted and would be delivered to recipients shortly.