Re: [govtrack] Proposed URIs for Congressional Things
- I'd answer in-line, but my webmail insists on sending the original message as an attachment regardless of settings, grrrr...
I've gotten all of the BioGuide scraped and in the system. The next chorelet (a pretty easy one, since I've done it before) is to meld the mbr107.xml into the store. I've revamped the file layout quite a bit, having gathered neat tables of every political party and official position/role. Those side tables are great for normalization and are useful on their own. Matter of fact, I wound up with several new support tables. I'm documenting as I go and will have examples ready for the group in a day or three.
Luckily, those slide right into Excel 2003 politely and comes out as a nice text file. Piece of cake to parse. Literally a two-step. Now that I look at the flat rendition, I think I'll add an E-Contact field to house the contact form URLs in addition to the Email field in CONGRESS. We can decide how to react on query later.
The rep-to-district mapping is rather a nice find, since that relates the Congress file directly to the FIPS and the Census Places2K files already on-hand. That is, if you know a rep's surname, for instance, you can find all the cities, towns, etc in his/her district. Handy stuff to know, particularly if one wants to find money-trails. I've also got the rep-to-committee mapping and the complete phone list. The senate phone list is in PDF, but it looks as if that will pose little problem.
BTW, I've been using a very handy little free tool, pdf2html for 3 or 4 years now. PDF's that are mostly text move over quite nicely and HTML is a heckuva lot easier to parse out. If anyone wants a copy, I think I have the original tgz file. I'll go look for it when I get home tonight. It's written in Perl, so it should work for about everyone on the list.
The only counter to the don't-normalize proposition I can offer is that Millender-McDonald is absolutely not the same as MILLENDER-MCDONALD. Now that I think about it, making an upper-case virtual field for searches is zero problem. It doesn't cost the system any more or less to use a virtual field for searches. Piece o' cake. I'll rerun the file-builder and leave case alone. Since the nice folks running BioGuide did the hard work of pretty-printing, we'll continue to let them :-D
Insofar as nested classes go, I've flattened the main CONGRESS file somewhat. CONGRESS now only contains biographic/demographic information. There is a new file, CONGRESS_ROLES which IS multivalued, though -- little way out of that. The CONGRESS_ROLES file is keyed to the CONGRESS file... same item-id's. For each legislator there is a single record with two dependant attributes: SESSION and ROLE. For every value in the SESSION attribute there is an equally-ranked value in ROLE. With one read, you have each legislator's history.
If it were expressed in XML (actually an accurate representation of the internals), it would look something like:
Of course, a virtual field in the CONGRESS dictionary (currently called ROLES) would return the same information, but nested within a congress_record structure. A class within a class. Each legislator's set of roles would be one class (like above), but the same information can be delivered with a query to CONGRESS, which would also be a class of data.
Again, it doesn't matter where the data physically live -- it's more of a matter of describing what effects we wish to produce against what kinds of queries. The whole reason for normalization is to ensure that any set of data can relate absolutely and unerringly to any other. As currently designed, one can walk easily into and out of our bioguide to geolocation data, to census data, etc. Any of the other data can be presented concomitantly with ANY query. Call it a bigole cloud, but one in which you know the exact location, shape, and value of any single droplet within it.
As I said, I'm multilingual in human and computer languages... but learning RDF is a lot like learning the same-old same-old post-relational data concepts (ex-instructor, know 'em by heart) all over again. But in Mandarin. Which I don't speak :-) Right now I'm concentrating on getting the data IN... I'll worry about getting it out in RDF as we get to it.
I'm working on a universal query front-end, which I expect to have done in a few days. We can begin to look at the treasure-trove and evaluate how we want to express the contents of the stores. This project is going along-side the BioGuide/Congress project. There's enough of the web interface available so that we can all look at the dictionary structures to see what kinds of data are available and how we'd like to package it. All the files in the lineup are inter-related, such that a query on any one can as easily return data from any other(s). I've opened it up for examination: http://www.progressivenation.net "Choose a Research Area". For now, you can see a table layout for each of the published files. The query link works through to the final page -- if you push the "Report" button, you'll not be really happy just yet. It's a place to start discussion, though.
There's where I am at the moment. Full plate and counting.
- Thanks, Bill, for working on the bioguide scraping. It looks good.
Bill Farrell wrote:
> The only counter to the don't-normalize proposition I can offer isI was thinking more about displaying the field to users down the line.
> that Millender-McDonald is absolutely not the same as
If all you have is the uppercase version, you can't display it in normal
case to a user. The normal cased version has to stick around if you
ever want it displayed.
This goes to the larger issue of normalization. Normalization is
exactly the process of removing information that is extraneous to a
particular task. But, removing information is exactly the opposite of
what we want to do: bring together lots and lots of information.
- Joshua Tauberer
** Nothing Unreal Exists **