Loading ...
Sorry, an error occurred while loading the content.

Re: [govtrack] The What and Why of RDF

Expand Messages
  • Joshua Tauberer / GovTrack
    ... And since you mentioned you re tied to XML for your data consumers, I want to throw in that the reverse is true also. Going from RDF back to XML isn t too
    Message 1 of 3 , Mar 3, 2005
    • 0 Attachment
      Bill Farrell wrote:
      > The more I read, the sweeter RDF becomes. For me to produce RDF
      > output, it will be a bolt-on script or two at the most in coding
      > effort.

      And since you mentioned you're tied to XML for your data consumers, I
      want to throw in that the reverse is true also. Going from RDF back to
      XML isn't too bad either.

      > Engineering the proper community-wide environment requires a
      > bit more thought and discussion.

      Yes, exactly. Compared to an XML-based community where you're
      engineering a format, what we have to do is determine the best way to
      represent more abstract information.

      > The link Joshua provided in yesterday's post was the most helpful by
      > far. Thank you very much.

      I'm very glad it was helpful.

      > While *most* of the original field names remain the
      > similar to the mbr107.xml example, the BioGuide scrape entailed the
      > invention of some more field names. That is, I "just made stuff up"
      > as I went because it didn't previously exist. That doesn't mean that
      > a retrieval would immediately be understood by mankind or machine
      > correctly, simply described as XML. In fact, it almost guarantees
      > the opposite. RDF should fix that (if I begin to understand the
      > proper constructions).

      Of course, applications won't know what to do with predicates that you
      make up, but, exactly, with RDF making up new predicates doesn't mess
      things up as it would with XML/DTD/Schema.

      Another way to look at it, though, is that you're free to use other
      existing predicates where ever you like. So, for instance, if we didn't
      anticpiate using the existing XYZ predicate but you see how it could be
      useful to describe some data, you can go ahead and use the XYZ
      predicate. In this case, applications that do already know the XYZ
      predicate will immediately understand your use of it.

      > Further, the combined CONGRESS record is physically in
      > nested-relational (post-relational or NF2, if you prefer) format.
      > ...
      > However, it is the most efficient way of storing, searching and
      > retrieving a legislator's position and history. Apparently RDF
      > doesn't care and can handle it -- if the proper vocabulary is
      > constructed and employed. (more, way below)

      Efficiency is an interesting thing to think about, and I didn't give it
      any mention in the thing I wrote. Using N3 format, storing is pretty
      efficient. You just choose the right namespace abbreviations.
      Searching and retreiving is another story. Having a congress-specific
      search-and-retreiver will always be more efficient than a generic RDF
      query tool.

      I'm not too concerned about this, though. When you need efficiency, you
      can always take existing RDF and transform it into a more custom,
      specific format that's more efficient for your needs. In fact, that's
      basically what GovTrack does now. It runs off of some custom XML
      formats, because it's easier for me to program the site that way and
      because I can do some custom indexing to make searching fast. But, for
      the purposes of sharing the data, I (will) use RDF.

      As you noted, with the right vocabulary RDF can describe anything, so
      you can always use RDF as a public format separate from your internal
      format.

      > The <Sessions /> section in XML does not physically exist in the
      > CONGRESS file, but is necessary to describe the relationship of a
      > legislator's role for each session of Congress in which s/he
      > served. (Do I understand this correctly to be a "blank node" in RDF?)

      Yes. I have the same type of nodes, as blank nodes, in my people.rdf
      file. Here's an abbreviated example:

      <rdf:RDF>
      <pol:Politician
      rdf:about="urn://govshare.info/data/us/congress/people/1995/akaka">
      <foaf:name>Daniel Akaka</foaf:name>
      <foaf:homepage>http://akaka.senate.gov</foaf:homepage>
      <pol:role>
      <pol:Term>
      <pol:begin>2001-01-01</pol:begin>
      <pol:end>2006-12-31</pol:end>
      <pol:office
      rdf:resource="urn://govshare.info/data/us/congress/107/HI"/>
      <pol:office
      rdf:resource="urn://govshare.info/data/us/congress/108/HI"/>
      <pol:office
      rdf:resource="urn://govshare.info/data/us/congress/109/HI"/>
      </pol:Term>
      </pol:role>
      <pol:role>
      <pol:Term>
      <pol:begin>1995-01-01</pol:begin>
      <pol:end>2000-12-31</pol:end>
      <pol:office
      rdf:resource="urn://govshare.info/data/us/congress/104/HI"/>
      <pol:office
      rdf:resource="urn://govshare.info/data/us/congress/105/HI"/>
      <pol:office
      rdf:resource="urn://govshare.info/data/us/congress/106/HI"/>
      </pol:Term>
      </pol:role>
      </pol:Politician>

      RDF/XML gets to be difficult to read when you embed nodes like this.
      pol:role is a predicate which I used twice to relate Akaka to a pol:Term
      entity (a blank node, no URI). Each pol:Term is an abstract
      representation of basically an election he won giving him a term in
      office. The pol:office predicates relate those pol:Terms to the
      pol:Office entities that Akaka fills in virtue of having those
      pol:Terms. I happened to structure it so that in virtue of his winning
      a senate term, he fills three offices, one for each two-year session of
      Congress during his term as a senator. In this example it's not
      specified that those offices themselves have starting dates and ending
      dates.

      > I'm reading in the primer that there are containers for such things,
      > but I don't YET see how either the bag or seq containers describe
      > this situation adequately. (It may or may not even matter--I'm a bit
      > ignorant of the subtleties of RDF yet.)

      I haven't worked with those containers much yet. I'm not sure they have
      a particular use here.

      > For example, how did I know that SESSION was the controlling
      > attribute for the nested subtable?

      This is another shortcoming of XML and databases, compared to RDF. I
      only mentioned it in the end of the thing I wrote, but RDF can be
      self-describing. The 'ontology' that describes the pol:* predicates and
      classes I used above is at http://www.govtrack.us/share/politico.rdf.
      (View source to see the RDF.) And, that relies on other ontologies
      (FOAF, for instance).

      There is *a lot* to be learned in the realm of RDF ontologies.

      > Second part: establishing our common vocabularies.

      Whew. I'm mentally exhausted just from part one...

      > A good part of RDF rests on the adoption of common vocabularies.

      Once again, exactly right.

      > I'm just yet a bit hazy on the proper way to construct the vocabularies
      > we'll likely need or even if we should compile new and specifically
      > descriptive vocabularies at all. (Something tells me that we should.)

      For sure we will need to construct vocabularies. I've obviously already
      begun this, as an experiment to see what's involved. (See the other
      files vote.rdf and usbill.rdf in http://www.govtrack.us/share. The
      other files are downloaded from elsewhere.) There are very few
      vocabularies out there, and as far as I know, none that describe the
      complex government-related things we're talking about.

      > For Pythia, every named locality is member of the FIPS55 table
      > (Federal Information Processing Standard) and/or one of its
      > derivates. By using FIPS and only FIPS as a means of determining the
      > correct rendition of a city, area, township, county, blahblah I can
      > make sure that GNIS and Census data will absolutely interrelate
      > (which they don't "quite" otherwise when they're scraped). This gives
      > me strict, system-wide normalization.

      I was looking at census data this morning.

      Note that you don't have to use *only* FIPS. There can be many
      predicates relating a resource to a normalized code. E.g., in pseudo-N3
      format:

      new_york ogdex:fips55 "1234"
      new_york ogdex:usps "NY"
      new_york ogdex:census 22

      Where new_york is the URI for the state of New York.

      > The joy is that I wouldn't have to store any bill or action text, nor
      > would Joshua necessarily have to store the entire CONGRESS file or
      > any of the "foundation" type files (like FIPS and the derivations
      > therefrom) that Pythia holds. Simply by the descriptions in the RDF
      > we'd know exactly where to obtain atomic information IF a given
      > retrieval required it. (Am I getting this right?)

      Okay, this might be the only thing that you've jumped the gun on. :)

      RDF doesn't indicate where actually to get content. However, we could
      create/find an RDF vocabulary to describe such things. It's a minor
      implementation detail, but it's something RDF itself doesn't address.

      > Rather than for me sit here and "make stuff up" to complete the
      > CONGRESS example that Joshua suggested, it might be an idea to
      > discuss our implementation of a vocabulary first so that not only are
      > names for common objects identical throughout the community; the
      > points of origin (or authoritativeness) would also become well-known
      > and described.

      That's a good place to start, but I need a mental break before I suggest
      exactly how to begin on that.

      Thanks, Bill, for going over these issues in such great detail. It's a
      big help to get everyone on the same page and to get a plan of action
      started.

      --
      - Joshua Tauberer

      http://taubz.for.net

      ** Nothing Unreal Exists **
    Your message has been successfully submitted and would be delivered to recipients shortly.