Loading ...
Sorry, an error occurred while loading the content.

Invalid xml characters in bills

Expand Messages
  • Francis
    At least one bill s xml has invalid characters: http://www.govtrack.us/data/us/112/bills/h4517.xml The reason for this is that the source on thomas (maybe
    Message 1 of 2 , Apr 27, 2012
    • 0 Attachment
      At least one bill's xml has invalid characters:
      http://www.govtrack.us/data/us/112/bills/h4517.xml

      The reason for this is that the source on thomas (maybe ultimately at CRS) has invalid characters. So far what I've been seeing is the SUB (0x1A) control character.

      Other not-yet-scraped bills have this issue too (these are just the ones I have identified):
      http://hdl.loc.gov/loc.uscongress/legislation.112hr4715
      http://hdl.loc.gov/loc.uscongress/legislation.112hr4716

      I discovered this issue through the scraper at washingtonwatch

      At the very least the scraper needs to strip out invalid characters, but ultimately the upstream source of the bad characters needs to be fixed as well. Anyone know who to contact for that?
    • Josh Tauberer
      Thanks, Francis. I corrected the XML file and the scraper yesterday afternoon, so hopefully the data is now good. And I let my LOC contact know about HR4517
      Message 2 of 2 , Apr 28, 2012
      • 0 Attachment
        Thanks, Francis.

        I corrected the XML file and the scraper yesterday afternoon, so
        hopefully the data is now good.

        And I let my LOC contact know about HR4517 but I'll also pass on the
        other two --- thanks for those links.

        - Josh Tauberer (@JoshData)

        http://razor.occams.info

        On 04/27/2012 12:19 PM, Francis wrote:
        > At least one bill's xml has invalid characters:
        > http://www.govtrack.us/data/us/112/bills/h4517.xml
        >
        > The reason for this is that the source on thomas (maybe ultimately at CRS) has invalid characters. So far what I've been seeing is the SUB (0x1A) control character.
        >
        > Other not-yet-scraped bills have this issue too (these are just the ones I have identified):
        > http://hdl.loc.gov/loc.uscongress/legislation.112hr4715
        > http://hdl.loc.gov/loc.uscongress/legislation.112hr4716
        >
        > I discovered this issue through the scraper at washingtonwatch
        >
        > At the very least the scraper needs to strip out invalid characters, but ultimately the upstream source of the bad characters needs to be fixed as well. Anyone know who to contact for that?
        >
        >
        >
        > ------------------------------------
        >
        > Yahoo! Groups Links
        >
        >
        >
      Your message has been successfully submitted and would be delivered to recipients shortly.