Re: [govtrack] Data scrubbing?
- Well, that just makes govtrack all the more impressive. I'm not a
perl-monger, but I know python OK...
How long do you think the gov gets a clue about the modern mandate for
I knew about Wikipedia's Special:Export, but they have some
limitations on how often you can call it which would hamper my goal.
Which, if I wanted to do all pages for all languages, would not keep
up with update rates, or would be too close for comfort.
Also, that says "
# Friendly, low-speed bots are welcome viewing article pages, but not
# dynamically-generated pages please.
Special:Export clearly is a dynamic page.
I'm planning on downloading the whole mess, and my real concern is
dealing with garbage within the DB.
Thanks for the response, tho.
On Fri, 11 Feb 2005 18:46:44 -0500, Joshua Tauberer / GovTrack
> Jeremy Dunck wrote:
> > I'm thinking of doing a similarly large-scale data reporting service,
> > and was wondering if you have any tips on getting good data loaded?
> > Specifically, I'm thinking about providing a statistics and reporting
> > service for Wikipedia activity.
> > What tools do you use?
> Hi, Jeremy.
> GovTrack uses Perl scripts to screen-scrape websites. It's a
> last-resort type of method. The scripts dump the information into XML
> files, and I do a little bit of processing on the files (in Perl) to
> index the data by topic, generate stats, etc.
> Wikipedia will let you get XML versions of entries
> (http://en.wikipedia.org/wiki/Special:Export), so you could skip the
> screen-scraping step.
> I'm not sure if that really answers your question, though.
> - Joshua Tauberer
> ** Nothing Unreal Exists **
> Yahoo! Groups Sponsor
> Yahoo! Groups Links
> To visit your group on the web, go to:
> To unsubscribe from this group, send an email to:
> Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
- Jeremy Dunck wrote:
> Well, that just makes govtrack all the more impressive.Heh heh, thanks.
> How long do you think the gov gets a clue about the modern mandate forWhen they see existing, practical uses for the information, they might
> electronic openness?
start to get the idea.
> I knew about Wikipedia's Special:Export, but they have someYou might see if you can request a way to download their entire database
> limitations on how often you can call it which would hamper my goal.
to bootstrap the process, and then just fetch the changed pages in the
future, if you really want the data.
- Joshua Tauberer
** Nothing Unreal Exists **