Loading ...
Sorry, an error occurred while loading the content.

Re: [govtrack] Data scrubbing?

Expand Messages
  • Jeremy Dunck
    Well, that just makes govtrack all the more impressive. I m not a perl-monger, but I know python OK... How long do you think the gov gets a clue about the
    Message 1 of 4 , Feb 11, 2005
    • 0 Attachment
      Well, that just makes govtrack all the more impressive. I'm not a
      perl-monger, but I know python OK...

      How long do you think the gov gets a clue about the modern mandate for
      electronic openness?

      I knew about Wikipedia's Special:Export, but they have some
      limitations on how often you can call it which would hamper my goal.

      http://en.wikipedia.org/robots.txt

      Specifically:
      "
      Crawl-delay: 1
      "
      Which, if I wanted to do all pages for all languages, would not keep
      up with update rates, or would be too close for comfort.

      Also, that says "
      #
      # Friendly, low-speed bots are welcome viewing article pages, but not
      # dynamically-generated pages please.
      "

      Special:Export clearly is a dynamic page.

      I'm planning on downloading the whole mess, and my real concern is
      dealing with garbage within the DB.

      Thanks for the response, tho.

      On Fri, 11 Feb 2005 18:46:44 -0500, Joshua Tauberer / GovTrack
      <tauberer@...> wrote:
      >
      >
      > Jeremy Dunck wrote:
      > > I'm thinking of doing a similarly large-scale data reporting service,
      > > and was wondering if you have any tips on getting good data loaded?
      > >
      > > Specifically, I'm thinking about providing a statistics and reporting
      > > service for Wikipedia activity.
      > >
      > > What tools do you use?
      >
      > Hi, Jeremy.
      >
      > GovTrack uses Perl scripts to screen-scrape websites. It's a
      > last-resort type of method. The scripts dump the information into XML
      > files, and I do a little bit of processing on the files (in Perl) to
      > index the data by topic, generate stats, etc.
      >
      > Wikipedia will let you get XML versions of entries
      > (http://en.wikipedia.org/wiki/Special:Export), so you could skip the
      > screen-scraping step.
      >
      > I'm not sure if that really answers your question, though.
      >
      > --
      > - Joshua Tauberer
      >
      > http://taubz.for.net
      >
      > ** Nothing Unreal Exists **
      >
      >
      >
      >
      >
      >
      >
      > Yahoo! Groups Sponsor
      >
      > ADVERTISEMENT
      >
      >
      > ________________________________
      > Yahoo! Groups Links
      >
      > To visit your group on the web, go to:
      > http://groups.yahoo.com/group/govtrack/
      >
      > To unsubscribe from this group, send an email to:
      > govtrack-unsubscribe@yahoogroups.com
      >
      > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
    • Joshua Tauberer / GovTrack
      ... Heh heh, thanks. ... When they see existing, practical uses for the information, they might start to get the idea. ... You might see if you can request a
      Message 2 of 4 , Feb 12, 2005
      • 0 Attachment
        Jeremy Dunck wrote:
        > Well, that just makes govtrack all the more impressive.

        Heh heh, thanks.

        > How long do you think the gov gets a clue about the modern mandate for
        > electronic openness?

        When they see existing, practical uses for the information, they might
        start to get the idea.

        > I knew about Wikipedia's Special:Export, but they have some
        > limitations on how often you can call it which would hamper my goal.

        You might see if you can request a way to download their entire database
        to bootstrap the process, and then just fetch the changed pages in the
        future, if you really want the data.

        Good luck.

        --
        - Joshua Tauberer

        http://taubz.for.net

        ** Nothing Unreal Exists **
      Your message has been successfully submitted and would be delivered to recipients shortly.