Re: [govtrack] xml versions of some bills on Thomas
- Hey Scott,
I run into a lot of the same problems doing genealogy -- there's a *LOT* of handscrawl that comes out of the pre-information age. Shoot, looking around the net, there isn't a lot better at the moment :-) But one of the things I do to put dindin on the table is to take a load of scrawled "just-stuff", lay it out, pick it apart, find the relationships and make a retrievable database out of it.
It's also clear that so far HAVA is a thoroughgoing mess with haphazard implementations patchworked across the country. The largest part of the problem is that while Congress mandated uniform voting procedures, the funding that was supposed to enable that to happen never appeared (can we say, "No child left behind"??) At the local level we up against a lot of "this is the way we've always done it".
I'm in contact with some precinct captains, BOE chairpersons and volunteers -- the one theme that resounds nationally is that there are no standards for anything. The way the current HAVA law is laid out, there aren't likely to be any, any time soon. With this in mind, there are a lot of grass-roots efforts where local boards are comparing notes and procedures with other boards to try and create some.
Even at that, there will be a long time before much of anything below state level is either accessible or uniform :-( This is where we have an opportunity to support the xml.house.gov technical group, to meet with and gather ideas from our local BOE's, then to work out the best sorts of interchange amongst the entirety.
Being the nearly newest in the govtrack e-group, I'm not sure where everyone else is in their projects, what kinds of data we're scraping/normalizing/publishing, what we have on hand because someone sat down and keyed it in, or if any interchange agreements have been established. I don't know how much I don't know yet :LOL:
A little about why I'm excited about this particular project. Before it was "IT" and somewhere after the Dark Ages (DP or Data Processing) there was MIS (Management Information Systems). That's where I grew up. That was the Huge Room Where It All Came Together. MIS had consolidated and normalized all your company's disparate "stuff" together where anyone could draw out the appropriate "stuff", mined in an appropriate way. Enter Steve Jobs and Bill Gates. MIS was slowly dismantled, the larger jobs were "outsourced" and the smaller jobs scattered to desktops. In the 80's outsourcing craze, many companies lost the ability to integrate and reanalyze its stores in flexible ways. But everyone had a PC on their desk, b'damn, even if they had no idea what to do with it.
A generation has forgotten how to consolidate and reanalyze; our government's IT is clearly in a mess; "public information" is nearly an oxymoron; and we're repeating history by moving back from desktop DP to management information. The system is a mess, but I can see on this list a way to Do Something Substantial.
Much of the work I did in the late 70's and early 80's was to catch a chunk of data that pooted out of the end of one process, then figure out how to make it palatable to another process that could use pieces of the first to flesh-out or add new ways of looking at its own stores. To me, finding, scraping, normalizing and integrating the raw sources are the child's play part. Publishing/Interchange comes a bit more tricky.
These are the things I'm her to learn about the group members: who in the group has what kinds of information, who's authoritative on a given data store, what makes them authoritative (by source or by agreement), who can/does mirror components, how are updates published and applied (timeliness), when is a store considered stale, if it's stale should it offer a non-authoritative answer, or none? In return, what may I offer that's of value in moving forward?
----- Original Message -----
From: Scott Beardsley <sc0ttbeardsley@...>
Sent: Sun, 27 Feb 2005 17:20:04 +0000
Subject: Re: [govtrack] xml versions of some bills on Thomas
- Bill Farrell wrote:
> A couple of weeks ago I discovered the Thomas trove and set about combining several data sources into a comprehensive data store on my Pythia site. I had to scrape Thomas and BioGuide to come up with a complete-ish dataset since apparently none of the information is in any ONE place :-P~~~Neat. Bioguide was on my list of things to eventually scrape. Were you
able to get everything out of bioguide? I'd be interested in seeing
all of that data.
I'm in the process of setting up all of my data for GovTrack in RDF.
Right now my server is trying to get it all (roughly 3 million
'statements') into MySQL... it takes a while. But, the wonders of RDF
don't begin until someone else uses some of the same RDF vocabularies to
describe other related info. If you're interested in working on
exporting some of your information in RDF (even at the least the list of
IDs that the House is using), let's talk more about that.
> Being the nearly newest in the govtrack e-group, I'm not sure where everyone else is in their projectsScott was going to work on California historical election data (and
possibly current legislative info). Have you gotten started on that, Scott?
I don't know if there are any other parallel projects that got anywhere yet.
> if any interchange agreements have been established.Not that I know of. ParticipatoryPolitics.org is actively working on a
site that will use GovTrack's data, but of course anyone is welcome to
All of this discussion is really just beginning.
> These are the things I'm her to learn about the group members: who in the group has what kinds of information, who's authoritative on a given data store, what makes them authoritative (by source or by agreement),These are good questions that don't have any answers for yet (in part
because there is only a very small number of sources of data). I hope
OGDEX.com (which isn't working for me at the moment) will become the
place with the answers to those types of questions.
> who can/does mirror components, how are updates published and applied (timeliness), when is a store considered stale, if it's stale should it offer a non-authoritative answer, or none?More good questions that need to be worked out. Something to think
about is whether we need a new format to specify, for a data source, how
to retreive its data, how often it's updated, who owns and creates the
data, etc. Something that will make it easy to gather and mirror data
from an array of sources.
- Joshua Tauberer
** Nothing Unreal Exists **
- --- Joshua Tauberer <tauberer@...> wrote:
> Scott was going to work on California historicalI was originally working on parsing bill data into xml
> election data (and
> possibly current legislative info). Have you gotten
> started on that, Scott?
but the discovery of aroundthecapitol.com put the
brakes on that work. I've yet to hear back from the
sites creator (the other Scott) about licensing and
other details so I might revisit this later if he is
I'm working on manually digitizing (ie Hard Copy ->
Digital Photo -> Spreadsheet -> XML) California
election data now. I've gathered the last 5 years of
CA elections into a spreedsheet for each election
(I've found normal spreadsheet apps to be much faster
than going directly to xml). I'm almost done with a
perl script to translate those spreadsheets into xml.
I wanted to get a few years of data to fully
understand what type of data I'm working with. I've
found that I can finish a full election in about 20-30
hours so I estimate all of CA's election data should
take one person a year of full time data entry.
It may be possible to use OCR software to automate
some of this. I'm taking digital photo's for older
elections. I'll send out a Flickr set link when I have
> I don't know if there are any other parallelFor California: aroundthecapitol.com but Scott (the
> projects that got anywhere yet.
other one) hasn't shown any interest in joining
Do you Yahoo!?
Yahoo! Mail - Easier than ever with enhanced search. Learn more.
- --- Scott Beardsley <sc0ttbeardsley@...> wrote:
> I've got an email intoI'm still not sure if there is an underlying source of
> xml-bill-comments@... for more info about
the name-id but it seems they are becoming
standardized on whatever the bioguide is using.
From "Carmel, Joe" <joe.carmel@...>:
For the House, the name-id is the Member's id from
http://bioguide.congress.gov This provides a unique
identification for each Member of Congress for all
The ids are unique and you should not assume anything
about their numbering; if anything you should assume
the numbering is random (although unique). Do not
assume that a given name will begin with a specific
letter because they don't.
Do you Yahoo!?
Yahoo! Mail - You care about security. So do we.