Loading ...
Sorry, an error occurred while loading the content.

Re: [govtrack] Invalid xml characters in bills

Expand Messages
  • Josh Tauberer
    Thanks, Francis. I corrected the XML file and the scraper yesterday afternoon, so hopefully the data is now good. And I let my LOC contact know about HR4517
    Message 1 of 2 , Apr 28 5:42 AM
    • 0 Attachment
      Thanks, Francis.

      I corrected the XML file and the scraper yesterday afternoon, so
      hopefully the data is now good.

      And I let my LOC contact know about HR4517 but I'll also pass on the
      other two --- thanks for those links.

      - Josh Tauberer (@JoshData)

      http://razor.occams.info

      On 04/27/2012 12:19 PM, Francis wrote:
      > At least one bill's xml has invalid characters:
      > http://www.govtrack.us/data/us/112/bills/h4517.xml
      >
      > The reason for this is that the source on thomas (maybe ultimately at CRS) has invalid characters. So far what I've been seeing is the SUB (0x1A) control character.
      >
      > Other not-yet-scraped bills have this issue too (these are just the ones I have identified):
      > http://hdl.loc.gov/loc.uscongress/legislation.112hr4715
      > http://hdl.loc.gov/loc.uscongress/legislation.112hr4716
      >
      > I discovered this issue through the scraper at washingtonwatch
      >
      > At the very least the scraper needs to strip out invalid characters, but ultimately the upstream source of the bad characters needs to be fixed as well. Anyone know who to contact for that?
      >
      >
      >
      > ------------------------------------
      >
      > Yahoo! Groups Links
      >
      >
      >
    Your message has been successfully submitted and would be delivered to recipients shortly.