The What and Why of RDF
- Hi, guys.
As I've mentioned a bunch of times, I'm convinced the way to approach
data sharing is using RDF. If you don't know much about RDF or if you
disagree, please read:
I hope it's pretty clear, but if you have any thoughts about how I might
improve it, or if you want to see something added to it, or if you're
not convinced by it, please let me know.
- Joshua Tauberer
** Nothing Unreal Exists **
- The more I read, the sweeter RDF becomes. For me to produce RDF output, it will be a bolt-on script or two at the most in coding effort. Engineering the proper community-wide environment requires a bit more thought and discussion.
(I'm in a bit of a learning curve, having taken on XSL, RDF and gypsy-style violin full-on in the same week. At least now I can now hum a sprightly hora or two to keep my eyes open as I wade through the docs :-)
The link Joshua provided in yesterday's post was the most helpful by far. Thank you very much. I'm still reading it over. While I'm multilingual in human and machine languages, but I tend to grasp data processing ideas in data processing terms. While Joshua thinks in terms of semantics, I think in terms of engineering, normalization, storage and retrieval. We are products of our disparate professional backgrounds. Seeing some practical examples has been most helpful, but I think I need to stop short for a moment and throw some things out to the list in order to stir discussion and increase my level of understanding of what the goals and parameters are.
Feel free to tell me where I'm being an id10t.
While I'm sold on the idea of using RDF as a means of interchange, it does entail some planning and agreement to make its use practical. It would be helpful to me to work through an example of our own devising. I'm throwing this out as a means of stirring discussion and to obtain some input on conventions and practices, as well as to learn proper RDF design.
Part one: BioGuide and the construction of a web-retrievable image
Joshua's suggestion, the BioGuide example, is a fine place to start, since that's something we all know something about and will use nearly immediately. With the group's indulgence, let's use that for an example if I may.
At Joshua's suggestion, I've been exploring the W3C RDF Primer (actually quite good). I did see that it allows encapsulation of existing XML data, but that's exactly what I want to get away from. For now, I'm chained to XML at the member-user level, since many Pythia consumers are using Excel, OOCalc, and Access to grab chunks of data directly into their app. None of those apps yet support RDF directly (shoot, OOCalc barely supports XML, forcing me to offer HTML renditions as well), but I'm banking on "The Day They Do". Until then, there's plenty of time to get the implementation of RDF rock-solid.
(NB: this is largely an exercise -- the view of the CONGRESS file can look like anything we decide it does, including any information from any other file in Pythia.)
My version of the BioGuide listing (the CONGRESS file) combines attributes from the mbr107.xml example from xml.house.gov with attributes obtained from scraping the BioGuide pages, thus rendering a somewhat more complete picture of a legislator's personal information. While *most* of the original field names remain the similar to the mbr107.xml example, the BioGuide scrape entailed the invention of some more field names. That is, I "just made stuff up" as I went because it didn't previously exist. That doesn't mean that a retrieval would immediately be understood by mankind or machine correctly, simply described as XML. In fact, it almost guarantees the opposite. RDF should fix that (if I begin to understand the proper constructions).
Further, the combined CONGRESS record is physically in nested-relational (post-relational or NF2, if you prefer) format. (You've probably seen the XML that is returned from Pythia for a CONGRESS record by now.) This is the natural rendition of the record within the UniVerse system, but without adequate description can be a bit confusing to those still in the relational database world. However, it is the most efficient way of storing, searching and retrieving a legislator's position and history. Apparently RDF doesn't care and can handle it -- if the proper vocabulary is constructed and employed. (more, way below)
CONGRESS has three dependent attributes: SESSION, POSITION, and SH. Each of these attributes can concomitantly hold zero or more values in parallel. That is, for each value in the controlling attribute (SESSION), there will be a related value in an equally-ranked position in the other two attributes. This creates, in effect, an wholly-contained table nested within the CONGRESS record. In the current XML, this relationship is accurately described as:
<OfficialName>JONES, WALTER BEAMAN, JR.</OfficialName>
<FormalName>MR. JONES OF NORTH CAROLINA</FormalName>
The <Sessions /> section in XML does not physically exist in the CONGRESS file, but is necessary to describe the relationship of a legislator's role for each session of Congress in which s/he served. (Do I understand this correctly to be a "blank node" in RDF?) Zero intervening steps are required to transform a CONGRESS XML record for post-relational databases or commonly-used desktop applications such as Access, OpenOffice Calc, and Excel as well as Dot Net XML parsers. These already naturally break this section into related edge tables, automatically performing the necessary transformation. (NOTE: the goal should always be to have zero transformation steps between the RDF and the target app, although RDF does not guarantee it. It's nice to help the user out as much as possible.)
Those of us who also use MySQL, PostgreSQL, DB2, etc might require a bit more description within the text (which RDF seems to allow for quite nicely) in order to accomplish the transformation. In practice, post-relational records are a lot like PHP or Perl arrays, where given vectors within the array may in turn be arrays, to whatever depth is required to describe the complete object. While this is a natural condition in UniVerse, Perl or PHP (and a convenient practice for grouping properties of a singular object), relational databases need some assistance.
I'm reading in the primer that there are containers for such things, but I don't YET see how either the bag or seq containers describe this situation adequately. (It may or may not even matter--I'm a bit ignorant of the subtleties of RDF yet.)
In my current CONGRESS/BioGuide XML rendition, "Sessions" is a container (albeit artificial) that holds a group of exactly three individual and interdependent attributes. Thus, "Sessions" exists as a description of the relationship of these multivalued attributes and only at the time that at least one record exists in the output. The relationship further dictates that for each "Session" there will be zero or more values for each vector of the Sessions array: A Senate/House flag, the Position held (dependent attributes), and the name of the Congressional session (also known as the controlling attribute). In the current XML, the controlling attribute value is co-opted for use as a key to the nested table's row.
Thus, for each value in the controlling attribute (the session), there must exist a related and equally-positioned value in the dependent attributes. If that value is nil, then the placeholding value mark (or RDF/XML tag for that attribute) must still exist in order to ensure the equal ranking of values across attributes.
Here is a picture of a typical CONGRESS record:
>CT CONGRESS J000255J000255
0001 JONES, WALTER BEAMAN, JR.
0002 MR. JONES OF NORTH CAROLINA
>Attributes 17, 18, and 19 are the origin of the <Sessions> XML construct. We see that in the 107th Congress, Mr. Jones was a REPRESENTATIVE in the House, again in the 108th, etc. But unless something about the description of the "Sessions" property with its subproperties tells you that these multivalued attributes are scaled together, how would one know?
Not that in any case, anyone would *care* about the original construction, so long as the relationships are correctly described and preserved. How someone else would interpret or store it would be up to that consumer. The structure is eye-apparent, but not quite apparent to a flat, relational database.
For example, how did I know that SESSION was the controlling attribute for the nested subtable? I created the table, that's how...a p----poor reason indeed. That doesn't tell another soul why the arrangement is the way it is. The primer falls just a bit short of nested relational attributes with dependent multivalues.
Indeed, this exercise is much more of an interesting exploration into the possibilities of fully describing an output rather than "this is what I intend to do". At this point I can only assume that virtual fields (actually a greater percentage of the Pythia system) can exist in the vocabulary simply because "we agree" that's what a given result would be called. I haven't thought that far into it yet.
Second part: establishing our common vocabularies.
A good part of RDF rests on the adoption of common vocabularies. I'm just yet a bit hazy on the proper way to construct the vocabularies we'll likely need or even if we should compile new and specifically descriptive vocabularies at all. (Something tells me that we should.) Since we have the chance to be completely unambiguous in the nature of any data delivery (and from any source in the OGDEX community), now would the perfect opportunity to allow for complete and faithful reconstructions of any delivery into a local database by the proper use of vocabularies. To wit:
The vocabularies we employ should remain constant throughout the OGDEX contributing community in order to be useful. Common objects, like CITY or STATE are all-but self-naming and self-describing, but there are toMAYto/toMAHto differences in individual databases (ST or STATE; County, Co, Cty, etc).
For Pythia, every named locality is member of the FIPS55 table (Federal Information Processing Standard) and/or one of its derivates. By using FIPS and only FIPS as a means of determining the correct rendition of a city, area, township, county, blahblah I can make sure that GNIS and Census data will absolutely interrelate (which they don't "quite" otherwise when they're scraped). This gives me strict, system-wide normalization.
For example, anyone in the OGDEX community who wanted to look up a list of counties for a given state and have the list of correct spellings, zip ranges, etc, returned could query Pythia because they know where that file is always maintained. These are exactly the same values that may be used within the GNIS, Census, or FCC files. (Lame example, but a readily handy one.)
Knowing the URI is a part of the trick that RDF seems to handle very well. But if Pythia says "COUNTY" or "County" (semantically identical, but mechanically dissimilar), and GovTrack says "Cty", Steve's site knows "CO" and Neal's site knows "cty", we break the rules of normalization. That's the only point of breakdown we could potentially have BUT can neatly and smoothly avoid.
By adopting a common vocabulary at the outset, we give ourselves the immediate capability to map any datum from any of our disparate stores into a common interchange vocabulary. For example in my DB, I call two-character state id's "ST" whilst state names are "STATE". It wouldn't matter of someone else called their state abbreviations and names "FRED" and "JOE" respectively -- if the common vocabulary is in place.
Ideally, I'd like to see that as a resource on OGDEX even before we start publishing data; possibly as a first step. That is, if we see a description "ogdex:st" (or pick an example), regardless of whose physical system the desired record resides we know and agree that ogex:st holds a FIPS55 2-character state abbreviation, ogdex:state holds a FIPS55 state name, etc etc. I can still call it FRED on my system so long as the external description matches the common vocabulary. (Joshua?)
With that in place, if say, I retrieve an action item from GovTrack, any time I see a description called "ogdex:st", I could take that same field description and value to Pythia to retrieve a list of counties in that state with census statistics for an impact study. Or go to Steve's system for something that neither Joshua or I currently house but Steve does store. Or Neal's... it wouldn't matter where the resource datum might be - it would always be named the same and should return an identical result. (Satisfying NF3 non-loss decomposition, an entree condition to full post-normal form.)
Similarly, I could retrieve a bill from GovTrack and see that the sponsor is identified as person-id "J000255"; I could then retrieve that legislator's current and historical information against Pythia's CONGRESS store because we've agreed that that particular field name will be employed at all member sites.
The joy is that I wouldn't have to store any bill or action text, nor would Joshua necessarily have to store the entire CONGRESS file or any of the "foundation" type files (like FIPS and the derivations therefrom) that Pythia holds. Simply by the descriptions in the RDF we'd know exactly where to obtain atomic information IF a given retrieval required it. (Am I getting this right?)
Rather than for me sit here and "make stuff up" to complete the CONGRESS example that Joshua suggested, it might be an idea to discuss our implementation of a vocabulary first so that not only are names for common objects identical throughout the community; the points of origin (or authoritativeness) would also become well-known and described.
----- Original Message -----
From: Joshua Tauberer <tauberer@...>
Sent: Wed, 2 Mar 2005 20:57:24 +0000
Subject: [govtrack] The What and Why of RDF
- Bill Farrell wrote:
> The more I read, the sweeter RDF becomes. For me to produce RDFAnd since you mentioned you're tied to XML for your data consumers, I
> output, it will be a bolt-on script or two at the most in coding
want to throw in that the reverse is true also. Going from RDF back to
XML isn't too bad either.
> Engineering the proper community-wide environment requires aYes, exactly. Compared to an XML-based community where you're
> bit more thought and discussion.
engineering a format, what we have to do is determine the best way to
represent more abstract information.
> The link Joshua provided in yesterday's post was the most helpful byI'm very glad it was helpful.
> far. Thank you very much.
> While *most* of the original field names remain theOf course, applications won't know what to do with predicates that you
> similar to the mbr107.xml example, the BioGuide scrape entailed the
> invention of some more field names. That is, I "just made stuff up"
> as I went because it didn't previously exist. That doesn't mean that
> a retrieval would immediately be understood by mankind or machine
> correctly, simply described as XML. In fact, it almost guarantees
> the opposite. RDF should fix that (if I begin to understand the
> proper constructions).
make up, but, exactly, with RDF making up new predicates doesn't mess
things up as it would with XML/DTD/Schema.
Another way to look at it, though, is that you're free to use other
existing predicates where ever you like. So, for instance, if we didn't
anticpiate using the existing XYZ predicate but you see how it could be
useful to describe some data, you can go ahead and use the XYZ
predicate. In this case, applications that do already know the XYZ
predicate will immediately understand your use of it.
> Further, the combined CONGRESS record is physically inEfficiency is an interesting thing to think about, and I didn't give it
> nested-relational (post-relational or NF2, if you prefer) format.
> However, it is the most efficient way of storing, searching and
> retrieving a legislator's position and history. Apparently RDF
> doesn't care and can handle it -- if the proper vocabulary is
> constructed and employed. (more, way below)
any mention in the thing I wrote. Using N3 format, storing is pretty
efficient. You just choose the right namespace abbreviations.
Searching and retreiving is another story. Having a congress-specific
search-and-retreiver will always be more efficient than a generic RDF
I'm not too concerned about this, though. When you need efficiency, you
can always take existing RDF and transform it into a more custom,
specific format that's more efficient for your needs. In fact, that's
basically what GovTrack does now. It runs off of some custom XML
formats, because it's easier for me to program the site that way and
because I can do some custom indexing to make searching fast. But, for
the purposes of sharing the data, I (will) use RDF.
As you noted, with the right vocabulary RDF can describe anything, so
you can always use RDF as a public format separate from your internal
> The <Sessions /> section in XML does not physically exist in theYes. I have the same type of nodes, as blank nodes, in my people.rdf
> CONGRESS file, but is necessary to describe the relationship of a
> legislator's role for each session of Congress in which s/he
> served. (Do I understand this correctly to be a "blank node" in RDF?)
file. Here's an abbreviated example:
RDF/XML gets to be difficult to read when you embed nodes like this.
pol:role is a predicate which I used twice to relate Akaka to a pol:Term
entity (a blank node, no URI). Each pol:Term is an abstract
representation of basically an election he won giving him a term in
office. The pol:office predicates relate those pol:Terms to the
pol:Office entities that Akaka fills in virtue of having those
pol:Terms. I happened to structure it so that in virtue of his winning
a senate term, he fills three offices, one for each two-year session of
Congress during his term as a senator. In this example it's not
specified that those offices themselves have starting dates and ending
> I'm reading in the primer that there are containers for such things,I haven't worked with those containers much yet. I'm not sure they have
> but I don't YET see how either the bag or seq containers describe
> this situation adequately. (It may or may not even matter--I'm a bit
> ignorant of the subtleties of RDF yet.)
a particular use here.
> For example, how did I know that SESSION was the controllingThis is another shortcoming of XML and databases, compared to RDF. I
> attribute for the nested subtable?
only mentioned it in the end of the thing I wrote, but RDF can be
self-describing. The 'ontology' that describes the pol:* predicates and
classes I used above is at http://www.govtrack.us/share/politico.rdf.
(View source to see the RDF.) And, that relies on other ontologies
(FOAF, for instance).
There is *a lot* to be learned in the realm of RDF ontologies.
> Second part: establishing our common vocabularies.Whew. I'm mentally exhausted just from part one...
> A good part of RDF rests on the adoption of common vocabularies.Once again, exactly right.
> I'm just yet a bit hazy on the proper way to construct the vocabulariesFor sure we will need to construct vocabularies. I've obviously already
> we'll likely need or even if we should compile new and specifically
> descriptive vocabularies at all. (Something tells me that we should.)
begun this, as an experiment to see what's involved. (See the other
files vote.rdf and usbill.rdf in http://www.govtrack.us/share. The
other files are downloaded from elsewhere.) There are very few
vocabularies out there, and as far as I know, none that describe the
complex government-related things we're talking about.
> For Pythia, every named locality is member of the FIPS55 tableI was looking at census data this morning.
> (Federal Information Processing Standard) and/or one of its
> derivates. By using FIPS and only FIPS as a means of determining the
> correct rendition of a city, area, township, county, blahblah I can
> make sure that GNIS and Census data will absolutely interrelate
> (which they don't "quite" otherwise when they're scraped). This gives
> me strict, system-wide normalization.
Note that you don't have to use *only* FIPS. There can be many
predicates relating a resource to a normalized code. E.g., in pseudo-N3
new_york ogdex:fips55 "1234"
new_york ogdex:usps "NY"
new_york ogdex:census 22
Where new_york is the URI for the state of New York.
> The joy is that I wouldn't have to store any bill or action text, norOkay, this might be the only thing that you've jumped the gun on. :)
> would Joshua necessarily have to store the entire CONGRESS file or
> any of the "foundation" type files (like FIPS and the derivations
> therefrom) that Pythia holds. Simply by the descriptions in the RDF
> we'd know exactly where to obtain atomic information IF a given
> retrieval required it. (Am I getting this right?)
RDF doesn't indicate where actually to get content. However, we could
create/find an RDF vocabulary to describe such things. It's a minor
implementation detail, but it's something RDF itself doesn't address.
> Rather than for me sit here and "make stuff up" to complete theThat's a good place to start, but I need a mental break before I suggest
> CONGRESS example that Joshua suggested, it might be an idea to
> discuss our implementation of a vocabulary first so that not only are
> names for common objects identical throughout the community; the
> points of origin (or authoritativeness) would also become well-known
> and described.
exactly how to begin on that.
Thanks, Bill, for going over these issues in such great detail. It's a
big help to get everyone on the same page and to get a plan of action
- Joshua Tauberer
** Nothing Unreal Exists **