Loading ...
Sorry, an error occurred while loading the content.

Re: [govtrack] Proposed URIs for Congressional Things

Expand Messages
  • Bill Farrell
    I d answer in-line, but my webmail insists on sending the original message as an attachment regardless of settings, grrrr... I ve gotten all of the BioGuide
    Message 1 of 11 , Mar 17, 2005
    • 0 Attachment
      I'd answer in-line, but my webmail insists on sending the original message as an attachment regardless of settings, grrrr...

      I've gotten all of the BioGuide scraped and in the system. The next chorelet (a pretty easy one, since I've done it before) is to meld the mbr107.xml into the store. I've revamped the file layout quite a bit, having gathered neat tables of every political party and official position/role. Those side tables are great for normalization and are useful on their own. Matter of fact, I wound up with several new support tables. I'm documenting as I go and will have examples ready for the group in a day or three.

      I "disremembered" the javascript contact forms. Those are on http://www.senate.gov/general/contact_information/senators_cfm.cfm - a perfectly beastly page to scrape. I'll that puppy in while I'm re-casing the CONGRESS file. Same deal goes for the representative-to-district mapping.

      Luckily, those slide right into Excel 2003 politely and comes out as a nice text file. Piece of cake to parse. Literally a two-step. Now that I look at the flat rendition, I think I'll add an E-Contact field to house the contact form URLs in addition to the Email field in CONGRESS. We can decide how to react on query later.

      The rep-to-district mapping is rather a nice find, since that relates the Congress file directly to the FIPS and the Census Places2K files already on-hand. That is, if you know a rep's surname, for instance, you can find all the cities, towns, etc in his/her district. Handy stuff to know, particularly if one wants to find money-trails. I've also got the rep-to-committee mapping and the complete phone list. The senate phone list is in PDF, but it looks as if that will pose little problem.

      BTW, I've been using a very handy little free tool, pdf2html for 3 or 4 years now. PDF's that are mostly text move over quite nicely and HTML is a heckuva lot easier to parse out. If anyone wants a copy, I think I have the original tgz file. I'll go look for it when I get home tonight. It's written in Perl, so it should work for about everyone on the list.

      The only counter to the don't-normalize proposition I can offer is that Millender-McDonald is absolutely not the same as MILLENDER-MCDONALD. Now that I think about it, making an upper-case virtual field for searches is zero problem. It doesn't cost the system any more or less to use a virtual field for searches. Piece o' cake. I'll rerun the file-builder and leave case alone. Since the nice folks running BioGuide did the hard work of pretty-printing, we'll continue to let them :-D

      Insofar as nested classes go, I've flattened the main CONGRESS file somewhat. CONGRESS now only contains biographic/demographic information. There is a new file, CONGRESS_ROLES which IS multivalued, though -- little way out of that. The CONGRESS_ROLES file is keyed to the CONGRESS file... same item-id's. For each legislator there is a single record with two dependant attributes: SESSION and ROLE. For every value in the SESSION attribute there is an equally-ranked value in ROLE. With one read, you have each legislator's history.

      If it were expressed in XML (actually an accurate representation of the internals), it would look something like:

      <congress_roles>
      <congress_roles_record id="D000443">
      <roles>
      <role sessionid="25">DELEGATE</role>
      <role sessionid="26">DELEGATE</role>
      <role sessionid="31">REPRESENTATIVE</role>
      <role sessionid="32">REPRESENTATIVE</role>
      </roles>
      </congress_roles_record>
      <congress_roles_record id="L000402">
      <roles>
      <role sessionid="24">REPRESENTATIVE</role>
      <role sessionid="25">REPRESENTATIVE</role>
      </roles>
      </congress_roles_record>

      (yaddayadda)

      </congress_roles>

      Of course, a virtual field in the CONGRESS dictionary (currently called ROLES) would return the same information, but nested within a congress_record structure. A class within a class. Each legislator's set of roles would be one class (like above), but the same information can be delivered with a query to CONGRESS, which would also be a class of data.

      Again, it doesn't matter where the data physically live -- it's more of a matter of describing what effects we wish to produce against what kinds of queries. The whole reason for normalization is to ensure that any set of data can relate absolutely and unerringly to any other. As currently designed, one can walk easily into and out of our bioguide to geolocation data, to census data, etc. Any of the other data can be presented concomitantly with ANY query. Call it a bigole cloud, but one in which you know the exact location, shape, and value of any single droplet within it.

      As I said, I'm multilingual in human and computer languages... but learning RDF is a lot like learning the same-old same-old post-relational data concepts (ex-instructor, know 'em by heart) all over again. But in Mandarin. Which I don't speak :-) Right now I'm concentrating on getting the data IN... I'll worry about getting it out in RDF as we get to it.

      I'm working on a universal query front-end, which I expect to have done in a few days. We can begin to look at the treasure-trove and evaluate how we want to express the contents of the stores. This project is going along-side the BioGuide/Congress project. There's enough of the web interface available so that we can all look at the dictionary structures to see what kinds of data are available and how we'd like to package it. All the files in the lineup are inter-related, such that a query on any one can as easily return data from any other(s). I've opened it up for examination: http://www.progressivenation.net "Choose a Research Area". For now, you can see a table layout for each of the published files. The query link works through to the final page -- if you push the "Report" button, you'll not be really happy just yet. It's a place to start discussion, though.

      There's where I am at the moment. Full plate and counting.

      Best regards,
      Bill
    • Joshua Tauberer / GovTrack
      Thanks, Bill, for working on the bioguide scraping. It looks good. ... I was thinking more about displaying the field to users down the line. If all you have
      Message 2 of 11 , Mar 18, 2005
      • 0 Attachment
        Thanks, Bill, for working on the bioguide scraping. It looks good.

        Bill Farrell wrote:

        > The only counter to the don't-normalize proposition I can offer is
        > that Millender-McDonald is absolutely not the same as
        > MILLENDER-MCDONALD.

        I was thinking more about displaying the field to users down the line.
        If all you have is the uppercase version, you can't display it in normal
        case to a user. The normal cased version has to stick around if you
        ever want it displayed.

        This goes to the larger issue of normalization. Normalization is
        exactly the process of removing information that is extraneous to a
        particular task. But, removing information is exactly the opposite of
        what we want to do: bring together lots and lots of information.

        --
        - Joshua Tauberer

        http://taubz.for.net

        ** Nothing Unreal Exists **
      Your message has been successfully submitted and would be delivered to recipients shortly.