No problem. You can download the whole CONGRESS file (or any selected portion thereof) from http://pythia.progressivenation.net/modules/tinycontent3/index.php?id=15
There's a menu to the left where the schemas and docs live.
Scraping BioGuide was a simple but rather tedious manual scrape -- I figured for a one-shot to populate the initial file it wouldn't matter as long as the framework got built and the schema shaped itself. The XML that comes from the query is a fairly accurate representation of the post-relational recordset, wherein there is no concept of "edge-tables" -- repeated data is repeated naturally. Thus, some fields are mutually dependent: Session controls the number of values in Position, etc. The XML output is built to allow a complete personal record on output, but it will be up to the consumer how to deal with the internally-nested "tables". (Post-relational is sometimes called "nested relational" or post-first-normal-form -- or epithets not fit to post :LOL: but it does allow for rapid response to changes and deals well with massive strings)
As you'll see, there are holes in the information, where BioGuide information just didn't fill out the mbr107.xml information. I normalized where I could and constructed the records to allow for any sort of cross-sectioning retrieval. The PN consumers love that slice-and-dice; the construction makes direct retrieval to spreadsheets a no-brainer.
The CONGRESS file I expect to be reasonably static. I'm not sure what publishing the entirety (or any portion) using RDF would be appropriate, except for publishing updates. RDF is great for publishing changes, but it's really rotten for query-and-retrieval. Our best tack probably is to publish updates via RDF while still allowing direct query to XML or HTML. Just thinking out loud...
I've seen that OGDEX is having a hard day or two. Oh, the joys of getting a site running :-) Do you need server space with MySQL? I have room on my server if you do. There's about 1/2 a terabyte of space allocated for GovTrack/Pythia/PN projects left, so I think we'll have a ways to go before filling it :-) I'm using a 64-bit AMD 3200 box with a couple of gigs of memory... horsepower and space I've got and am willing to donate to the project.
Insofar as exchanging data, as I get data into the system I try to get scripts out on the web side so that people can get at it. The "XML Data" section on Pythia has examples, links, and sample query forms so you can look at results before coding (try it before you buy it <smile>). The DTD's are shipped internally in the results, but if you are using Access or Excel, you can grab a schema from the Schemas link and run with it. It's a several step process for me to publish the DTD's and schemas, but emancipating data from the site is a lot simpler if I provide the tools to understand its structures. There again, they won't change that much and are easy to adapt.
If there's a call for it, I can also produce datasets in tab-delimited or natural non-first-normal-form format. From where I sit, each is a script-twiddle to accomplish. Last night I had to add "OutputAs=HTML" to the CONGRESS script since it was pretty clear that my OpenOffice was NOT going to open XML directly from the web, but it would happily parse an HTML table into a reasonable spreadsheet. But these are things we'll run into as we go. But each is "pure" XML/HTML/flat-text - no frills, mostly because the raw data returned from a query isn't yet suitable for publication as RSS or RDF -- they're just inappropriate structures for the purpose. But for consumer-level publication, would it be a thought for you go point a wget/curl/whatever at my stores, query out what you want, then publish RDF from there? That way you can select and publish the slices of interest without having to download the whole 70 gigs of related stores. Again, just thinking out loud...
Would it be an idea for you to let ParticipatoryPolitics know that you have your hands on a second working store at Pythia? Where we're sitting individually on project might actually be further down the pike as seen from the wider scope. I'd hate to be duplicating something someone else has already done -- like a typical consumer, I'd rather query out the pieces I'm missing from the opposite store when/as/if I need them.
We all might best to think of the entire project as a widely-networked but closely-normalized data cube... the details being in one URI or another for any particular segment. I think this view will allow better and more diverse analysis of any particular stratum of interest, allowing the consumers of one particular portal to frame their mining operations one way, whilst consumers of another site might frame their mining operations somewhat differently -- but everyone relies on known normalized stores. The only thing we'd need would be a queryable location to find URI's of interest. OGDEX would fit that bill perfectly (in your copious spare time, right? (chuckle))
Of course, if you'd like to visit by phone for questions that don't fit easily into the list format, you're welcome. E me a number and an appropriate time to call (I love m'broadband phone, no LD). I'm on east-coast time and am usually available after 5:30pm.
Bill Farrell wrote:
> A couple of weeks ago I discovered the Thomas trove and set about combining several <data sources into a comprehensive data store on my Pythia site. I had to scrape Thomas and BioGuide to come up with a complete-ish dataset since apparently none of the information is in any ONE place :-P~~~Neat. Bioguide was on my list of things to eventually scrape. Were you
able to get everything out of bioguide? I'd be interested in seeing
all of that data.
I'm in the process of setting up all of my data for GovTrack in RDF.
Right now my server is trying to get it all (roughly 3 million
'statements') into MySQL... it takes a while. But, the wonders of RDF
don't begin until someone else uses some of the same RDF vocabularies to
describe other related info. If you're interested in working on
exporting some of your information in RDF (even at the least the list of
IDs that the House is using), let's talk more about that.
> Being the nearly newest in the govtrack e-group, I'm not sure where everyone else is in their projectsScott was going to work on California historical election data (and
possibly current legislative info). Have you gotten started on that, Scott?
I don't know if there are any other parallel projects that got anywhere yet.
> if any interchange agreements have been established.Not that I know of. ParticipatoryPolitics.org is actively working on a
site that will use GovTrack's data, but of course anyone is welcome to
All of this discussion is really just beginning.
> These are the things I'm her to learn about the group members: who in the group has what kinds of information, who's authoritative on a given data store, what makes them authoritative (by source or by agreement),These are good questions that don't have any answers for yet (in part
because there is only a very small number of sources of data). I hope
OGDEX.com (which isn't working for me at the moment) will become the
place with the answers to those types of questions.
> who can/does mirror components, how are updates published and applied (timeliness), when is a store considered stale, if it's stale should it offer a non-authoritative answer, or none?More good questions that need to be worked out. Something to think
about is whether we need a new format to specify, for a data source, how
to retreive its data, how often it's updated, who owns and creates the
data, etc. Something that will make it easy to gather and mirror data
from an array of sources.
- Joshua Tauberer
** Nothing Unreal Exists **
- Bill Farrell wrote:
> No problem. You can download the whole CONGRESS file (or any selected portion thereof) from http://pythia.progressivenation.net/modules/tinycontent3/index.php?id=15Great, thanks. I'll look over that in a bit.
> I'm not sure what publishing the entirety (or any portion) using RDF would be appropriate, except for publishing updates. RDF is great for publishing changes, but it's really rotten for query-and-retrieval.So... in fact, RDF probably isn't what you think it is (I'm guessing you
have RSS feeds in mind), although the two are often related.
To me, RDF is the technical solution to the problem that's brought us to
this mail list. But, RDF is complicated and not well known or
understood (I'm still learning a lot day by day). Worse yet, I haven't
come across any websites that do a good job of explaining what RDF is
and where it should be used over other formats.
As a result, I'm working on writing a good (I hope) explanation of RDF.
Should be done in a few days, and I'll post it then.
> I've seen that OGDEX is having a hard day or two. Oh, the joys of getting a site running :-) Do you need server space with MySQL?OGDEX.com seems to be back. Thanks for the offer, but actually
DemocracyInAction hosts the site and I don't think they have any
shortage of resources.
> I'd hate to be duplicating something someone else has already doneYeah, and I've done a lot at the federal level. :)
> We all might best to think of the entire project as a widely-networked but closely-normalized data cube... the details being in one URI or another for any particular segment. I think this view will allow better and more diverse analysis of any particular stratum of interest, allowing the consumers of one particular portal to frame their mining operations one way, whilst consumers of another site might frame their mining operations somewhat differently -- but everyone relies on known normalized stores. The only thing we'd need would be a queryable location to find URI's of interest.You're describing exactly what RDF is meant to do. Stay tuned. :)
- Joshua Tauberer
** Nothing Unreal Exists **