Re: [govtrack] The What and Why of RDF
- Bill Farrell wrote:
> The more I read, the sweeter RDF becomes. For me to produce RDFAnd since you mentioned you're tied to XML for your data consumers, I
> output, it will be a bolt-on script or two at the most in coding
want to throw in that the reverse is true also. Going from RDF back to
XML isn't too bad either.
> Engineering the proper community-wide environment requires aYes, exactly. Compared to an XML-based community where you're
> bit more thought and discussion.
engineering a format, what we have to do is determine the best way to
represent more abstract information.
> The link Joshua provided in yesterday's post was the most helpful byI'm very glad it was helpful.
> far. Thank you very much.
> While *most* of the original field names remain theOf course, applications won't know what to do with predicates that you
> similar to the mbr107.xml example, the BioGuide scrape entailed the
> invention of some more field names. That is, I "just made stuff up"
> as I went because it didn't previously exist. That doesn't mean that
> a retrieval would immediately be understood by mankind or machine
> correctly, simply described as XML. In fact, it almost guarantees
> the opposite. RDF should fix that (if I begin to understand the
> proper constructions).
make up, but, exactly, with RDF making up new predicates doesn't mess
things up as it would with XML/DTD/Schema.
Another way to look at it, though, is that you're free to use other
existing predicates where ever you like. So, for instance, if we didn't
anticpiate using the existing XYZ predicate but you see how it could be
useful to describe some data, you can go ahead and use the XYZ
predicate. In this case, applications that do already know the XYZ
predicate will immediately understand your use of it.
> Further, the combined CONGRESS record is physically inEfficiency is an interesting thing to think about, and I didn't give it
> nested-relational (post-relational or NF2, if you prefer) format.
> However, it is the most efficient way of storing, searching and
> retrieving a legislator's position and history. Apparently RDF
> doesn't care and can handle it -- if the proper vocabulary is
> constructed and employed. (more, way below)
any mention in the thing I wrote. Using N3 format, storing is pretty
efficient. You just choose the right namespace abbreviations.
Searching and retreiving is another story. Having a congress-specific
search-and-retreiver will always be more efficient than a generic RDF
I'm not too concerned about this, though. When you need efficiency, you
can always take existing RDF and transform it into a more custom,
specific format that's more efficient for your needs. In fact, that's
basically what GovTrack does now. It runs off of some custom XML
formats, because it's easier for me to program the site that way and
because I can do some custom indexing to make searching fast. But, for
the purposes of sharing the data, I (will) use RDF.
As you noted, with the right vocabulary RDF can describe anything, so
you can always use RDF as a public format separate from your internal
> The <Sessions /> section in XML does not physically exist in theYes. I have the same type of nodes, as blank nodes, in my people.rdf
> CONGRESS file, but is necessary to describe the relationship of a
> legislator's role for each session of Congress in which s/he
> served. (Do I understand this correctly to be a "blank node" in RDF?)
file. Here's an abbreviated example:
RDF/XML gets to be difficult to read when you embed nodes like this.
pol:role is a predicate which I used twice to relate Akaka to a pol:Term
entity (a blank node, no URI). Each pol:Term is an abstract
representation of basically an election he won giving him a term in
office. The pol:office predicates relate those pol:Terms to the
pol:Office entities that Akaka fills in virtue of having those
pol:Terms. I happened to structure it so that in virtue of his winning
a senate term, he fills three offices, one for each two-year session of
Congress during his term as a senator. In this example it's not
specified that those offices themselves have starting dates and ending
> I'm reading in the primer that there are containers for such things,I haven't worked with those containers much yet. I'm not sure they have
> but I don't YET see how either the bag or seq containers describe
> this situation adequately. (It may or may not even matter--I'm a bit
> ignorant of the subtleties of RDF yet.)
a particular use here.
> For example, how did I know that SESSION was the controllingThis is another shortcoming of XML and databases, compared to RDF. I
> attribute for the nested subtable?
only mentioned it in the end of the thing I wrote, but RDF can be
self-describing. The 'ontology' that describes the pol:* predicates and
classes I used above is at http://www.govtrack.us/share/politico.rdf.
(View source to see the RDF.) And, that relies on other ontologies
(FOAF, for instance).
There is *a lot* to be learned in the realm of RDF ontologies.
> Second part: establishing our common vocabularies.Whew. I'm mentally exhausted just from part one...
> A good part of RDF rests on the adoption of common vocabularies.Once again, exactly right.
> I'm just yet a bit hazy on the proper way to construct the vocabulariesFor sure we will need to construct vocabularies. I've obviously already
> we'll likely need or even if we should compile new and specifically
> descriptive vocabularies at all. (Something tells me that we should.)
begun this, as an experiment to see what's involved. (See the other
files vote.rdf and usbill.rdf in http://www.govtrack.us/share. The
other files are downloaded from elsewhere.) There are very few
vocabularies out there, and as far as I know, none that describe the
complex government-related things we're talking about.
> For Pythia, every named locality is member of the FIPS55 tableI was looking at census data this morning.
> (Federal Information Processing Standard) and/or one of its
> derivates. By using FIPS and only FIPS as a means of determining the
> correct rendition of a city, area, township, county, blahblah I can
> make sure that GNIS and Census data will absolutely interrelate
> (which they don't "quite" otherwise when they're scraped). This gives
> me strict, system-wide normalization.
Note that you don't have to use *only* FIPS. There can be many
predicates relating a resource to a normalized code. E.g., in pseudo-N3
new_york ogdex:fips55 "1234"
new_york ogdex:usps "NY"
new_york ogdex:census 22
Where new_york is the URI for the state of New York.
> The joy is that I wouldn't have to store any bill or action text, norOkay, this might be the only thing that you've jumped the gun on. :)
> would Joshua necessarily have to store the entire CONGRESS file or
> any of the "foundation" type files (like FIPS and the derivations
> therefrom) that Pythia holds. Simply by the descriptions in the RDF
> we'd know exactly where to obtain atomic information IF a given
> retrieval required it. (Am I getting this right?)
RDF doesn't indicate where actually to get content. However, we could
create/find an RDF vocabulary to describe such things. It's a minor
implementation detail, but it's something RDF itself doesn't address.
> Rather than for me sit here and "make stuff up" to complete theThat's a good place to start, but I need a mental break before I suggest
> CONGRESS example that Joshua suggested, it might be an idea to
> discuss our implementation of a vocabulary first so that not only are
> names for common objects identical throughout the community; the
> points of origin (or authoritativeness) would also become well-known
> and described.
exactly how to begin on that.
Thanks, Bill, for going over these issues in such great detail. It's a
big help to get everyone on the same page and to get a plan of action
- Joshua Tauberer
** Nothing Unreal Exists **