Loading ...
Sorry, an error occurred while loading the content.

38Re: [govtrack] Data scrubbing?

Expand Messages
  • Jeremy Dunck
    Feb 11, 2005
    • 0 Attachment
      Well, that just makes govtrack all the more impressive. I'm not a
      perl-monger, but I know python OK...

      How long do you think the gov gets a clue about the modern mandate for
      electronic openness?

      I knew about Wikipedia's Special:Export, but they have some
      limitations on how often you can call it which would hamper my goal.

      http://en.wikipedia.org/robots.txt

      Specifically:
      "
      Crawl-delay: 1
      "
      Which, if I wanted to do all pages for all languages, would not keep
      up with update rates, or would be too close for comfort.

      Also, that says "
      #
      # Friendly, low-speed bots are welcome viewing article pages, but not
      # dynamically-generated pages please.
      "

      Special:Export clearly is a dynamic page.

      I'm planning on downloading the whole mess, and my real concern is
      dealing with garbage within the DB.

      Thanks for the response, tho.

      On Fri, 11 Feb 2005 18:46:44 -0500, Joshua Tauberer / GovTrack
      <tauberer@...> wrote:
      >
      >
      > Jeremy Dunck wrote:
      > > I'm thinking of doing a similarly large-scale data reporting service,
      > > and was wondering if you have any tips on getting good data loaded?
      > >
      > > Specifically, I'm thinking about providing a statistics and reporting
      > > service for Wikipedia activity.
      > >
      > > What tools do you use?
      >
      > Hi, Jeremy.
      >
      > GovTrack uses Perl scripts to screen-scrape websites. It's a
      > last-resort type of method. The scripts dump the information into XML
      > files, and I do a little bit of processing on the files (in Perl) to
      > index the data by topic, generate stats, etc.
      >
      > Wikipedia will let you get XML versions of entries
      > (http://en.wikipedia.org/wiki/Special:Export), so you could skip the
      > screen-scraping step.
      >
      > I'm not sure if that really answers your question, though.
      >
      > --
      > - Joshua Tauberer
      >
      > http://taubz.for.net
      >
      > ** Nothing Unreal Exists **
      >
      >
      >
      >
      >
      >
      >
      > Yahoo! Groups Sponsor
      >
      > ADVERTISEMENT
      >
      >
      > ________________________________
      > Yahoo! Groups Links
      >
      > To visit your group on the web, go to:
      > http://groups.yahoo.com/group/govtrack/
      >
      > To unsubscribe from this group, send an email to:
      > govtrack-unsubscribe@yahoogroups.com
      >
      > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
    • Show all 4 messages in this topic