Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

govtrack · GovTrack.us Discussion List

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 462
  • Category: United States
  • Founded: Nov 3, 2004
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Messages

Advanced
Messages Help
Messages 59 - 88 of 1188   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Show Message Summaries Sort by Date ^  
#59 From: Scott Beardsley <sc0ttbeardsley@...>
Date: Tue Mar 1, 2005 3:20 pm
Subject: Name-id's
sc0ttbeardsley
Send Email Send Email
 
--- Scott Beardsley <sc0ttbeardsley@...> wrote:

> I've got an email into
> xml-bill-comments@... for more info about
> name-id.

I'm still not sure if there is an underlying source of
the name-id but it seems they are becoming
standardized on whatever the bioguide is using.

Scott

From "Carmel, Joe" <joe.carmel@...>:

For the House, the name-id is the Member's id from
http://bioguide.congress.gov. This provides a unique
identification for each Member of Congress for all
time.

The ids are unique and you should not assume anything
about their numbering; if anything you should assume
the numbering is random (although unique).  Do not
assume that a given name will begin with a specific
letter because they don't.





__________________________________
Do you Yahoo!?
Yahoo! Mail - You care about security. So do we.
http://promotions.yahoo.com/new_mail

#60 From: Bill Farrell <jwwf@...>
Date: Tue Mar 1, 2005 4:03 pm
Subject: Re: Mixing Facts with Speculation/Gossip
idawannanoe
Send Email Send Email
 
Hi Scott,

You've raised some really good points.  I think you're absolutely right that
pure data mining in the same site with an open forum that has a decided purpose
of shaping public opinion is something we'd REALLY like to avoid.  That was the
reason I separated Pythia from WWW on ProgressiveNation.net.  The idea: let
visitors mine what they need to mine, the way they need to mine it -- on the
mining site.  Once they've drawn their conclusions or are ready to publish their
studies, they must go back to the main site.

As we're scraping and conditioning the raw data, we might think about a common
format for citing the source, perhaps briefly describing the normlization
procedures, and state outright that no opinion has been made on any of the
contents.  I'm not familiar with other members' sites, so I'm not sure what sort
of mining and publishing operations are already underway.

Joshua, would OGDEX support such a resource registration/citation form?  This
kind of thing is more common in the genealogy world, where citation of source is
EVERYTHING. (Genealogists are even more cynical than political researchers
:chuckle:)  That would give us an idea of the resources we've got on hand, so we
can begin building the inter-site communications.  (More on that in another
post).

I think by registering our resources on OGDEX and/or GovTrack we can lend some
assurance that we've at least done a peer-review and have agreed that the data
(wherever it may currently lie) is coherent, standardized, and as complete as
currently possible, given the technologies we're forced to employ.  (The
expression "BFH" [Big Friendly Hammer] springs to mind.)

Resource registration isn't the whole solution, but it would be a start.

Best!
Bill

----- Original Message -----
From: Scott Beardsley <sc0ttbeardsley@...>
To: govtrack@yahoogroups.com
Sent: Tue,  1 Mar 2005 00:34:50 +0000
Subject: [govtrack] Mixing Facts with Speculation/Gossip


>

Aroundthecapitol.com seems to be impacted by
intermigling with tables of data taken from official
sources. He had a section for anonymous discussions
and speculation about politicians and bills including
who may/may not be running in future elections.
Apparently he was threatened with legal action for
something someone said. See his blog entry:

http://www.aroundthecapitol.com/blog/archives/000016.html

What does everyone think about this? I understand the
legal implications of publishing anonymous comments,
(and I think he could probably win if he had the
resources to fight it) but I think there is an
underlying issue regarding data source integrity.

I'm all for free speech and public discource of the
issues, but is a data source site the proper place for
such discussions? Does a public forum put the data
source's, or the entities maintaining the data source,
integrity into question?

Another questionable area is that fact that the sites
creator is a lobbyist himself and has not ruled out
running for public office. As a user of the website
how can I be assured that the data has not been
tampered with?

I guess what I'm really looking for is how do I make
sure my own data gathering efforts are not
discredited? Process translucence perhaps?

Scott

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com





#61 From: "Ken Colburn" <ken.colburn@...>
Date: Tue Mar 1, 2005 6:21 pm
Subject: House Committee/Subcomm Votes
njinibi
Send Email Send Email
 
I want to raise the issue for the group's consideration of displaying U.S.
House committee and subcommittee votes on the Web in readily accessible
format.  Committee and subcommittee votes are critical as committees do
much of the work of the House, and there are far fewer limitations on
offering amendments than on the House floor.  Beyond the details below,
the issue is how to get the House to put committee and subcommittee votes
on the Web so they can be pulled readily and to do so shortly after each
vote.

You can see the problem by going to any committee's web site (with the
exception of Education and the Workforce) and looking for the tally of
votes in committee (try http://financialservices.house.gov - but since
there haven't been recorded votes in the 109th, try
http://financialservices.house.gov/legis.asp?formmode=print&congress=8).

Most committee votes are in individual committee reports in pdf format.
All committees that I had surveyed with one exception will not provide a
copy of tally sheets but require a personal visit to the committee office
to view, but not make a copy of, the tally sheet.  Subcommittee votes are
published at the end of the two-year Congress in a single House volume.
You can see the policy of the Financial Services Committe in an exchange I
had in 2002 at http://www.techpolitics.org/finservs.htm  Click on the link
to the vote on H.R. 1261 to see that the Education and the Workforce
Committee does publish votes in accessible format, which undercuts the
response I received from Financial Services.

There are other sources for the committee votes, but they also pose
problems for ready access:  Thomas has committee reports

http://thomas.loc.gov/cgi-bin/cpquery/23?&sid=cp108bBxzw&item=23&hd_count=49&xfo\
rm_type=3&r_t=a&dbname=cp108&r_n=hr697.108&sel=TOC_7719&

but in some (many? most?) cases the vote is missing, e.g.

http://thomas.loc.gov/cgi-bin/cpquery/23?&sid=cp108bBxzw&item=23&hd_count=49&xfo\
rm_type=3&r_t=a&dbname=cp108&r_n=hr697.108&sel=TOC_7719&\

The Government Printing Office puts up committee reports with votes in
ascii, but timing of posting is not certain (it certainly isn't immediate)
and I understand it would be very difficult to scrape the votes.  To find
House committee reports on the GPO site go to:

http://www.gpoaccess.gov/serialset/creports/index.html

-- Check the box next to House Reports and put "financial services" in
quotation marks under Search and submit.
-- Click on Text under number 10- "Financial Services Regulatory Relief
Act of 2003."
-- Then search page for "committee votes"  The second occurrence of these
words is the middle of the page - scroll down and you'll see a vote.

I have compiled committee votes in the past, e.g.
http://www.techpolitics.org/finservs.htm (before I was using mysql/php),
but this is time-consuming to produce.  Your thoughts on getting at the
issue are welcomed.

Ken Colburn

#62 From: Joshua Tauberer / GovTrack <tauberer@...>
Date: Tue Mar 1, 2005 8:40 pm
Subject: Re: Mixing Facts with Speculation/Gossip
tauberer
Send Email Send Email
 
Bill Farrell wrote:
> As we're scraping and conditioning the raw data, we might think about
> a common format for citing the source, perhaps briefly describing the
> normlization procedures, and state outright that no opinion has been
> made on any of the contents. [snip]
>
> Joshua, would OGDEX support such a resource registration/citation
> form?

OGDEX was started as a community effort with no particular leadership,
so don't look at me.  :)  Its mission is to serve as a hub for efforts
along these lines, and I agree that we will need a system for describing
data sources, and I also think OGDEX would work well as a central place
to list sources in that way.

I have lots of ideas (as per usual) about how to go about describing
data sources, but before I present them 1) I need to get everyone to
agree that RDF is the way to approach this (otherwise we're going to be
debating XML formats forever), and 2) we need to actually get different
data sources available on the web.

The ideal way to get all of this going is for someone that has data
(e.g. you, Bill) to pick out a slice of their data that is related to
what's on GovTrack but doesn't overlap with it, and then for you and
GovTrack to export that data in a common format (e.g. RDF).  These days
I'm just waiting for this to happen.

Take the bioguide IDs, for instance.  It's related to GovTrack's data in
that its about the same people that GovTrack has data about, but it
doesn't overlap because I don't have that info.  The common format I
hope to convince you of is RDF (pending my finishing the explanation of
RDF), and then I can work with you on getting the data exported in that
format...

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **

#63 From: Scott Beardsley <sc0ttbeardsley@...>
Date: Tue Mar 1, 2005 8:41 pm
Subject: Re: House Committee/Subcomm Votes
sc0ttbeardsley
Send Email Send Email
 
--- Ken Colburn <ken.colburn@...> wrote:

> Most committee votes are in individual committee
> reports in pdf format.

How do blind people access this information? There
might be a way to fight these difficult policies via
sec 508. I'm repeatedly amazed at the resistance to
opening data like this.

Kinda ironic that they have a "text-only version" link
on each page of their website (with just filler text
anyway) but not for the pdf's (which have the really
important text).

Scott



__________________________________
Do you Yahoo!?
Yahoo! Mail - Helps protect you from nasty viruses.
http://promotions.yahoo.com/new_mail

#64 From: Joshua Tauberer <tauberer@...>
Date: Wed Mar 2, 2005 8:57 pm
Subject: The What and Why of RDF
tauberer@...
Send Email Send Email
 
Hi, guys.

As I've mentioned a bunch of times, I'm convinced the way to approach
data sharing is using RDF.  If you don't know much about RDF or if you
disagree, please read:

    http://www.govtrack.us/articles/20050302rdf.xpd

I hope it's pretty clear, but if you have any thoughts about how I might
improve it, or if you want to see something added to it, or if you're
not convinced by it, please let me know.

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **

#65 From: "Ken Colburn" <ken.colburn@...>
Date: Thu Mar 3, 2005 2:41 pm
Subject: Re: House Committee/Subcomm Votes
njinibi
Send Email Send Email
 
Interesting thought, however this would likely have to be fought on its
merits rather than law enforcement as Congress generally exempts itself
  from civil rights and employment laws.  Another soft point is the
Congressional Internet Caucus:
http://www.netcaucus.org/members/  A congressional body devoted to
increasing use of the Internet should at the least be willing to support
making committee/subcommittee votes accessible on the Internet.  My
experience is that press on the Hill and interest groups involved with
openness issues (e.g. http://www.cdt.org/righttoknow/10mostwanted/) are
aware of the problem but see it as a lesser priority than a lot of other
issues.  The Govtrack group, focused on making government information
accessible, could play an important role in pursuing the issue.

Ken

*********
On Tue, 1 Mar 2005 12:41:35 -0800 (PST), Scott Beardsley
<sc0ttbeardsley@...> wrote:

>
> --- Ken Colburn <ken.colburn@...> wrote:
>
>> Most committee votes are in individual committee
>> reports in pdf format.
>
> How do blind people access this information? There
> might be a way to fight these difficult policies via
> sec 508. I'm repeatedly amazed at the resistance to
> opening data like this.
>
> Kinda ironic that they have a "text-only version" link
> on each page of their website (with just filler text
> anyway) but not for the pdf's (which have the really
> important text).
>
> Scott
>

#66 From: Neal McBurnett <neal@...>
Date: Thu Mar 3, 2005 8:31 pm
Subject: scraping javascript sites, colorado example
emergent27
Send Email Send Email
 
One topic I haven't seen much discussion of here is examples or
techniques for scraping documents.

E.g. I'm interested.in Colorado legislation.  The site
(http://www.leg.state.co.us/) is pretty unfriendly from an automated
scraping standpoint, in my experience.  The legislation is in pdf,
which is a pain, but pdftotext seems to produce moderately scrapable
text.

Model software (in python?)  for cleaning up the output of pdftotext
would be welcomed.


But finding the pdfs is tricky.  E.g. I start at the list of bills,
which is designed to use form submission to select successive sets of
50 bills:

http://www.leg.state.co.us/Clics2005a/csl.nsf/BillFoldersSenate?openFrameset

Having found the right bill, say I want to look at the original
version of the bill (with legislative summary).  First we need to go
to the "All versions" link.  A typical URL, for SB05-079, is

 
http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont2/51B91106B515902487256F\
5E0078CF5A?Open

So they seem to intentionally introduce a hash to obfuscate the
URL.

Going to that URL gets us to the "Introduced Bill" name: 079_01.pdf
with an invisible (white) "n" appended to it, visible only when I
select the area for cut-and-paste.  Who knows why.

Clicking on that goes to a different place than advertised by "Copy
Link Location" in firefox, via a javascript _doClick() function:

  _doClick('87256EE50072C919.8975551e51fa01d087256dd30080e1d5/$Body/0.390C'...)

Ending up (at least for Firefox) at another useless intermediate web
page, which has more javascript that automatically downloads the pdf
(though I'm not sure why, and might be confused by the frame structure).

It also has a link to the "Bill" as
http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont3/51B91106B515902487256F\
5E0078CF5A?open&file=079_01.pdf
but again that is misleading.

When the pdf is downloading, the Firefox dialog box which asks if I
want to browse it or download it contains this url:

 
http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B5159024872\
56F5E0078CF5A/$FILE/

But if I try to load that via wget I get "400 Bad Request"
so they may be playing around with cookies or other magic.

The Firefox "Page Info" Links section points to this for the "Current
PDF Document:
 
http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B5159024872\
56F5E0078CF5A?OpenDocument

which again just produces the html page.

Can someone find a URL that just loads the pdf?
Is this sort of run-around common?  Any good insights or tools to deal with
it?  Anyone want to scrape, for a start, all the legislative summaries
for all the bills as introduced :-) ?

Thanks,

Neal McBurnett                 http://bcn.boulder.co.us/~neal/
Signed and/or sealed mail encouraged.  GPG/PGP Keyid: 2C9EBA60

#67 From: Bill Farrell <jwwf@...>
Date: Thu Mar 3, 2005 9:03 pm
Subject: Re: scraping javascript sites, colorado example
idawannanoe
Send Email Send Email
 
Yes, that kind of runaround is ALL too common.  The more I scrape, the more I
find that "public information" isn't. You may have read my diatribes on FCC et
al. :-)

Would it be an idea to use the --save-cookies {file} option to capture the
cookies as they're tossed?  If you can sleuth-out which (or which set) would be
the magic, it's possible to toss the cookies back by using the --load-cookies
{file} .

Best!
Bill

----- Original Message -----
From: Neal McBurnett <neal@...>
To: govtrack@yahoogroups.com
Sent: Thu,  3 Mar 2005 20:31:26 +0000
Subject: [govtrack] scraping javascript sites, colorado example


>

One topic I haven't seen much discussion of here is examples or
techniques for scraping documents.

E.g. I'm interested.in Colorado legislation.  The site
(http://www.leg.state.co.us/) is pretty unfriendly from an automated
scraping standpoint, in my experience.  The legislation is in pdf,
which is a pain, but pdftotext seems to produce moderately scrapable
text.

Model software (in python?)  for cleaning up the output of pdftotext
would be welcomed.


But finding the pdfs is tricky.  E.g. I start at the list of bills,
which is designed to use form submission to select successive sets of
50 bills:

http://www.leg.state.co.us/Clics2005a/csl.nsf/BillFoldersSenate?openFrameset

Having found the right bill, say I want to look at the original
version of the bill (with legislative summary).  First we need to go
to the "All versions" link.  A typical URL, for SB05-079, is

http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont2/51B91106B515902487256F5E0078CF5A?Open

So they seem to intentionally introduce a hash to obfuscate the
URL.

Going to that URL gets us to the "Introduced Bill" name: 079_01.pdf
with an invisible (white) "n" appended to it, visible only when I
select the area for cut-and-paste.  Who knows why.

Clicking on that goes to a different place than advertised by "Copy
Link Location" in firefox, via a javascript _doClick() function:

_doClick('87256EE50072C919.8975551e51fa01d087256dd30080e1d5/$Body/0.390C'...)

Ending up (at least for Firefox) at another useless intermediate web
page, which has more javascript that automatically downloads the pdf
(though I'm not sure why, and might be confused by the frame structure).

It also has a link to the "Bill" as http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont3/51B91106B515902487256F5E0078CF5A?open&file=079_01.pdf
but again that is misleading.

When the pdf is downloading, the Firefox dialog box which asks if I
want to browse it or download it contains this url:

http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B515902487256F5E0078CF5A/$FILE/

But if I try to load that via wget I get "400 Bad Request"
so they may be playing around with cookies or other magic.

The Firefox "Page Info" Links section points to this for the "Current
PDF Document:
http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B515902487256F5E0078CF5A?OpenDocument

which again just produces the html page.

Can someone find a URL that just loads the pdf?
Is this sort of run-around common?  Any good insights or tools to deal with
it?  Anyone want to scrape, for a start, all the legislative summaries
for all the bills as introduced :-) ?

Thanks,

Neal McBurnett                 http://bcn.boulder.co.us/~neal/
Signed and/or sealed mail encouraged.  GPG/PGP Keyid: 2C9EBA60





#68 From: Bill Farrell <jwwf@...>
Date: Thu Mar 3, 2005 10:26 pm
Subject: Re: The What and Why of RDF
idawannanoe
Send Email Send Email
 
The more I read, the sweeter RDF becomes.  For me to produce RDF output, it will
be a bolt-on script or two at the most in coding effort.  Engineering the proper
community-wide environment requires a bit more thought and discussion.

(I'm in a bit of a learning curve, having taken on XSL, RDF and gypsy-style
violin full-on in the same week.  At least now I can now hum a sprightly hora or
two to keep my eyes open as I wade through the docs :-)

The link Joshua provided in yesterday's post was the most helpful by far. Thank
you very much. I'm still reading it over.  While I'm multilingual in human and
machine languages, but I tend to grasp data processing ideas in data processing
terms.  While Joshua thinks in terms of semantics, I think in terms of
engineering, normalization, storage and retrieval.  We are products of our
disparate professional backgrounds.  Seeing some practical examples has been
most helpful, but I think I need to stop short for a moment and throw some
things out to the list in order to stir discussion and increase my level of
understanding of what the goals and parameters are.

Feel free to tell me where I'm being an id10t.

While I'm sold on the idea of using RDF as a means of interchange, it does
entail some planning and agreement to make its use practical.  It would be
helpful to me to work through an example of our own devising. I'm throwing this
out as a means of stirring discussion and to obtain some input on conventions
and practices, as well as to learn proper RDF design.

Part one: BioGuide and the construction of a web-retrievable image

Joshua's suggestion, the BioGuide example, is a fine place to start, since
that's something we all know something about and will use nearly immediately. 
With the group's indulgence, let's use that for an example if I may.

At Joshua's suggestion, I've been exploring the W3C RDF Primer (actually quite
good).  I did see that it allows encapsulation of existing XML data, but that's
exactly what I want to get away from.  For now, I'm chained to XML at the
member-user level, since many Pythia consumers are using Excel, OOCalc, and
Access to grab chunks of data directly into their app.  None of those apps yet
support RDF directly (shoot, OOCalc barely supports XML, forcing me to offer
HTML renditions as well), but I'm banking on "The Day They Do".  Until then,
there's plenty of time to get the implementation of RDF rock-solid.

First exercise:

(NB:  this is largely an exercise -- the view of the CONGRESS file can look like
anything we decide it does, including any information from any other file in
Pythia.)

My version of the BioGuide listing (the CONGRESS file) combines attributes from
the mbr107.xml example from xml.house.gov with attributes obtained from scraping
the BioGuide pages, thus rendering a somewhat more complete picture of a
legislator's personal information. While *most* of the original field names
remain the similar to the mbr107.xml example, the BioGuide scrape entailed the
invention of some more field names.  That is, I "just made stuff up" as I went
because it didn't previously exist.  That doesn't mean that a retrieval would
immediately be understood by mankind or machine correctly, simply described as
XML.  In fact, it almost guarantees the opposite.  RDF should fix that (if I
begin to understand the proper constructions).

Further, the combined CONGRESS record is physically in nested-relational
(post-relational or NF2, if you prefer) format. (You've probably seen the XML
that is returned from Pythia for a CONGRESS record by now.) This is the natural
rendition of the record within the UniVerse system, but without adequate
description can be a bit confusing to those still in the relational database
world.  However, it is the most efficient way of storing, searching and
retrieving a legislator's position and history.  Apparently RDF doesn't care and
can handle it -- if the proper vocabulary is constructed and employed. (more,
way below)

CONGRESS has three dependent attributes: SESSION, POSITION, and SH.  Each of
these attributes can concomitantly hold zero or more values in parallel.  That
is, for each value in the controlling attribute (SESSION), there will be a
related value in an equally-ranked position in the other two attributes.  This
creates, in effect, an wholly-contained table nested within the CONGRESS record.
In the current XML, this relationship is accurately described as:

<Congress>
-
	 <Congress_record id="J000255">
<OfficialName>JONES, WALTER BEAMAN, JR.</OfficialName>
<FormalName>MR. JONES OF NORTH CAROLINA</FormalName>
<FormerFormalName/>
<LOCSearchName>JONES+WALTER</LOCSearchName>
<State>NC</State>
<District/>
<Hometown/>
<Party>REPUBLICAN</Party>
<Phone/>
<Room/>
<ZipSuffix/>
<Website>http://jones.house.gov/</Website>
<Email/>
<Vacancy/>
<VacancyReason/>
<VacancyDate/>
<BirthYear>1943</BirthYear>
<DeathYear/>
<Sessions>
	 <Session session_number="107">
             <Position>REPRESENTATIVE</Position>
             <SH>H</SH>
         </Session>
	 <Session session_number="108">
             <Position>REPRESENTATIVE</Position>
             <SH>H</SH>
         </Session>
	 <Session session_number="109">
             <Position>REPRESENTATIVE</Position>
             <SH>H</SH>
         </Session>
</Sessions>
</Congress_record>
</Congress>

The <Sessions /> section in XML does not physically exist in the CONGRESS file,
but is necessary to describe the relationship of a legislator's role for each
session of       Congress in which s/he served. (Do I understand this correctly
to be a "blank node" in RDF?) Zero intervening steps are required to transform a
CONGRESS XML record for post-relational databases or commonly-used desktop
applications such as Access, OpenOffice Calc, and Excel as well as Dot Net XML
parsers.  These already naturally break this section into related edge tables,
automatically performing the necessary transformation.  (NOTE:  the goal should
always be to have zero transformation steps between the RDF and the target app,
although RDF does not guarantee it.  It's nice to help the user out as much as
possible.)

Those of us who also use MySQL, PostgreSQL, DB2, etc might require a bit more
description within the text (which RDF seems to allow for quite nicely) in order
to accomplish the transformation.  In practice, post-relational records are a
lot like PHP or Perl arrays, where given vectors within the array may in turn be
arrays, to whatever depth is required to describe the complete object.  While
this is a natural condition in UniVerse, Perl or PHP (and a convenient practice
for grouping properties of a singular object), relational databases need some
assistance.

I'm reading in the primer that there are containers for such things, but I don't
YET see how either the bag or seq containers describe this situation adequately.
(It may or may not even matter--I'm a bit ignorant of the subtleties of RDF
yet.)

In my current CONGRESS/BioGuide XML rendition, "Sessions" is a container (albeit
artificial) that holds a group of exactly three individual and interdependent
attributes.  Thus, "Sessions" exists as a description of the relationship of
these multivalued attributes and only at the time that at least one record
exists in the output.  The relationship further dictates that for each "Session"
there will be zero or more values for each vector of the Sessions array:  A
Senate/House flag, the Position held (dependent attributes), and the name of the
Congressional session (also known as the controlling attribute). In the current
XML, the controlling attribute value is co-opted for use as a key to the nested
table's row.

Thus, for each value in the controlling attribute (the session), there must
exist a related and equally-positioned value in the dependent attributes.  If
that value is nil, then the placeholding value mark (or RDF/XML tag for that
attribute) must still exist in order to ensure the equal ranking of values
across attributes.

Here is a picture of a typical CONGRESS record:
>CT CONGRESS J000255

      J000255
0001 JONES, WALTER BEAMAN, JR.
0002 MR. JONES OF NORTH CAROLINA
0003
0004 JONES+WALTER
0005 NC
0006
0007
0008 REPUBLICAN
0009
0010
0011
0012 http://jones.house.gov/
0013
0014
0015
0016
0017 107ý108ý109
0018 REPRESENTATIVEýREPRESENTATIVEýREPRESENTATIVE
0019 1943
0020
0021 HýHýH
>

Attributes 17, 18, and 19 are the origin of the <Sessions> XML construct.  We
see that in the 107th Congress, Mr. Jones was a REPRESENTATIVE in the House,
again in the 108th, etc.  But unless something about the description of the
"Sessions" property with its subproperties tells you that these multivalued
attributes are scaled together, how would one know?

Not that in any case, anyone would *care* about the original construction, so
long as the relationships are correctly described and preserved.  How someone
else would interpret or store it would be up to that consumer.  The structure is
eye-apparent, but not quite apparent to a flat, relational database.

For example, how did I know that SESSION was the controlling attribute for the
nested subtable?  I created the table, that's how...a p----poor reason indeed. 
That doesn't tell another soul why the arrangement is the way it is.  The primer
falls just a bit short of nested relational attributes with dependent
multivalues.

Indeed, this exercise is much more of an interesting exploration into the
possibilities of fully describing an output rather than "this is what I intend
to do". At this point I can only assume that virtual fields (actually a greater
percentage of the Pythia system) can exist in the vocabulary simply because "we
agree" that's what a given result would be called.  I haven't thought that far
into it yet.

Second part:  establishing our common vocabularies.

A good part of RDF rests on the adoption of common vocabularies. I'm just yet a
bit hazy on the proper way to construct the vocabularies we'll likely need or
even if we should compile new and specifically descriptive vocabularies at all.
(Something tells me that we should.) Since we have the chance to be completely
unambiguous in the nature of any data delivery (and from any source in the OGDEX
community), now would the perfect opportunity to allow for complete and faithful
reconstructions of any delivery into a local database by the proper use of
vocabularies. To wit:

The vocabularies we employ should remain constant throughout the OGDEX
contributing community in order to be useful.  Common objects, like CITY or
STATE are all-but self-naming and self-describing, but there are toMAYto/toMAHto
differences in individual databases (ST or STATE; County, Co, Cty, etc).

For Pythia, every named locality is member of the FIPS55 table (Federal
Information Processing Standard) and/or one of its derivates.  By using FIPS and
only FIPS as a means of determining the correct rendition of a city, area,
township, county, blahblah I can make sure that GNIS and Census data will
absolutely interrelate (which they don't "quite" otherwise when they're
scraped). This gives me strict, system-wide normalization.

For example, anyone in the OGDEX community who wanted to look up a list of
counties for a given state and have the list of correct spellings, zip ranges,
etc, returned could query Pythia because they know where that file is always
maintained.  These are exactly the same values that may be used within the GNIS,
Census, or FCC files. (Lame example, but a readily handy one.)

Knowing the URI is a part of the trick that RDF seems to handle very well.  But
if Pythia says "COUNTY" or "County" (semantically identical, but mechanically
dissimilar), and GovTrack says "Cty", Steve's site knows "CO" and Neal's site
knows "cty", we break the rules of normalization.  That's the only point of
breakdown we could potentially have BUT can neatly and smoothly avoid.

By adopting a common vocabulary at the outset, we give ourselves the immediate
capability to map any datum from any of our disparate stores into a common
interchange vocabulary.  For example in my DB, I call two-character state id's
"ST" whilst state names are "STATE".  It wouldn't matter of someone else called
their state abbreviations and names "FRED" and "JOE" respectively -- if the
common vocabulary is in place.

Ideally, I'd like to see that as a resource on OGDEX even before we start
publishing data; possibly as a first step.  That is, if we see a description
"ogdex:st" (or pick an example), regardless of whose physical system the desired
record resides we know and agree that ogex:st holds a FIPS55 2-character state
abbreviation, ogdex:state holds a FIPS55 state name, etc etc.  I can still call
it FRED on my system so long as the external description matches the common
vocabulary.  (Joshua?)

With that in place, if say, I retrieve an action item from GovTrack, any time I
see a description called "ogdex:st", I could take that same field description
and value to Pythia to retrieve a list of counties in that state with census
statistics for an impact study.  Or go to Steve's system for something that
neither Joshua or I currently house but Steve does store.  Or Neal's... it
wouldn't matter where the resource datum might be - it would always be named the
same and should return an identical result.  (Satisfying NF3 non-loss
decomposition, an entree condition to full post-normal form.)

Similarly, I could retrieve a bill from GovTrack and see that the sponsor is
identified as person-id "J000255"; I could then retrieve that legislator's
current and historical information against Pythia's CONGRESS store because we've
agreed that that particular field name will be employed at all member sites.

The joy is that I wouldn't have to store any bill or action text, nor would
Joshua necessarily have to store the entire CONGRESS file or any of the
"foundation" type files (like FIPS and the derivations therefrom) that Pythia
holds. Simply by the descriptions in the RDF we'd know exactly where to obtain
atomic information IF a given retrieval required it.  (Am I getting this right?)

Rather than for me sit here and "make stuff up" to complete the CONGRESS example
that Joshua suggested, it might be an idea to discuss our implementation of a
vocabulary first so that not only are names for common objects identical
throughout the community; the points of origin (or authoritativeness) would also
become well-known and described.

Best regards,
Bill

----- Original Message -----
From: Joshua Tauberer <tauberer@...>
To: govtrack@yahoogroups.com
Sent: Wed,  2 Mar 2005 20:57:24 +0000
Subject: [govtrack] The What and Why of RDF


>

Hi, guys.

As I've mentioned a bunch of times, I'm convinced the way to approach
data sharing is using RDF.  If you don't know much about RDF or if you
disagree, please read:

   http://www.govtrack.us/articles/20050302rdf.xpd

I hope it's pretty clear, but if you have any thoughts about how I might
improve it, or if you want to see something added to it, or if you're
not convinced by it, please let me know.

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **







#69 From: Joshua Tauberer / GovTrack <tauberer@...>
Date: Fri Mar 4, 2005 12:27 am
Subject: Re: The What and Why of RDF
tauberer
Send Email Send Email
 
Bill Farrell wrote:
> The more I read, the sweeter RDF becomes.  For me to produce RDF
> output, it will be a bolt-on script or two at the most in coding
> effort.

And since you mentioned you're tied to XML for your data consumers, I
want to throw in that the reverse is true also.  Going from RDF back to
XML isn't too bad either.

> Engineering the proper community-wide environment requires a
> bit more thought and discussion.

Yes, exactly.  Compared to an XML-based community where you're
engineering a format, what we have to do is determine the best way to
represent more abstract information.

> The link Joshua provided in yesterday's post was the most helpful by
> far. Thank you very much.

I'm very glad it was helpful.

  > While *most* of the original field names remain the
> similar to the mbr107.xml example, the BioGuide scrape entailed the
> invention of some more field names.  That is, I "just made stuff up"
> as I went because it didn't previously exist.  That doesn't mean that
> a retrieval would immediately be understood by mankind or machine
> correctly, simply described as XML.  In fact, it almost guarantees
> the opposite.  RDF should fix that (if I begin to understand the
> proper constructions).

Of course, applications won't know what to do with predicates that you
make up, but, exactly, with RDF making up new predicates doesn't mess
things up as it would with XML/DTD/Schema.

Another way to look at it, though, is that you're free to use other
existing predicates where ever you like.  So, for instance, if we didn't
anticpiate using the existing XYZ predicate but you see how it could be
useful to describe some data, you can go ahead and use the XYZ
predicate.  In this case, applications that do already know the XYZ
predicate will immediately understand your use of it.

> Further, the combined CONGRESS record is physically in
> nested-relational (post-relational or NF2, if you prefer) format.
> ...
> However, it is the most efficient way of storing, searching and
> retrieving a legislator's position and history.  Apparently RDF
> doesn't care and can handle it -- if the proper vocabulary is
> constructed and employed. (more, way below)

Efficiency is an interesting thing to think about, and I didn't give it
any mention in the thing I wrote.  Using N3 format, storing is pretty
efficient.  You just choose the right namespace abbreviations.
Searching and retreiving is another story.  Having a congress-specific
search-and-retreiver will always be more efficient than a generic RDF
query tool.

I'm not too concerned about this, though.  When you need efficiency, you
can always take existing RDF and transform it into a more custom,
specific format that's more efficient for your needs.  In fact, that's
basically what GovTrack does now.  It runs off of some custom XML
formats, because it's easier for me to program the site that way and
because I can do some custom indexing to make searching fast.  But, for
the purposes of sharing the data, I (will) use RDF.

As you noted, with the right vocabulary RDF can describe anything, so
you can always use RDF as a public format separate from your internal
format.

> The <Sessions /> section in XML does not physically exist in the
> CONGRESS file, but is necessary to describe the relationship of a
> legislator's role for each session of       Congress in which s/he
> served. (Do I understand this correctly to be a "blank node" in RDF?)

Yes.  I have the same type of nodes, as blank nodes, in my people.rdf
file.  Here's an abbreviated example:

<rdf:RDF>
    <pol:Politician
rdf:about="urn://govshare.info/data/us/congress/people/1995/akaka">
      <foaf:name>Daniel Akaka</foaf:name>
      <foaf:homepage>http://akaka.senate.gov</foaf:homepage>
      <pol:role>
        <pol:Term>
          <pol:begin>2001-01-01</pol:begin>
          <pol:end>2006-12-31</pol:end>
          <pol:office
rdf:resource="urn://govshare.info/data/us/congress/107/HI"/>
          <pol:office
rdf:resource="urn://govshare.info/data/us/congress/108/HI"/>
          <pol:office
rdf:resource="urn://govshare.info/data/us/congress/109/HI"/>
        </pol:Term>
      </pol:role>
      <pol:role>
        <pol:Term>
          <pol:begin>1995-01-01</pol:begin>
          <pol:end>2000-12-31</pol:end>
          <pol:office
rdf:resource="urn://govshare.info/data/us/congress/104/HI"/>
          <pol:office
rdf:resource="urn://govshare.info/data/us/congress/105/HI"/>
          <pol:office
rdf:resource="urn://govshare.info/data/us/congress/106/HI"/>
        </pol:Term>
      </pol:role>
    </pol:Politician>

RDF/XML gets to be difficult to read when you embed nodes like this.
pol:role is a predicate which I used twice to relate Akaka to a pol:Term
entity (a blank node, no URI).  Each pol:Term is an abstract
representation of basically an election he won giving him a term in
office.  The pol:office predicates relate those pol:Terms to the
pol:Office entities that Akaka fills in virtue of having those
pol:Terms.  I happened to structure it so that in virtue of his winning
a senate term, he fills three offices, one for each two-year session of
Congress during his term as a senator.  In this example it's not
specified that those offices themselves have starting dates and ending
dates.

> I'm reading in the primer that there are containers for such things,
> but I don't YET see how either the bag or seq containers describe
> this situation adequately. (It may or may not even matter--I'm a bit
> ignorant of the subtleties of RDF yet.)

I haven't worked with those containers much yet.  I'm not sure they have
a particular use here.

> For example, how did I know that SESSION was the controlling
> attribute for the nested subtable?

This is another shortcoming of XML and databases, compared to RDF.  I
only mentioned it in the end of the thing I wrote, but RDF can be
self-describing.  The 'ontology' that describes the pol:* predicates and
classes I used above is at http://www.govtrack.us/share/politico.rdf.
(View source to see the RDF.)  And, that relies on other ontologies
(FOAF, for instance).

There is *a lot* to be learned in the realm of RDF ontologies.

> Second part:  establishing our common vocabularies.

Whew.  I'm mentally exhausted just from part one...

> A good part of RDF rests on the adoption of common vocabularies.

Once again, exactly right.

> I'm just yet a bit hazy on the proper way to construct the vocabularies
> we'll likely need or even if we should compile new and specifically
> descriptive vocabularies at all. (Something tells me that we should.)

For sure we will need to construct vocabularies.  I've obviously already
begun this, as an experiment to see what's involved.  (See the other
files vote.rdf and usbill.rdf in http://www.govtrack.us/share.  The
other files are downloaded from elsewhere.)  There are very few
vocabularies out there, and as far as I know, none that describe the
complex government-related things we're talking about.

> For Pythia, every named locality is member of the FIPS55 table
> (Federal Information Processing Standard) and/or one of its
> derivates.  By using FIPS and only FIPS as a means of determining the
> correct rendition of a city, area, township, county, blahblah I can
> make sure that GNIS and Census data will absolutely interrelate
> (which they don't "quite" otherwise when they're scraped). This gives
> me strict, system-wide normalization.

I was looking at census data this morning.

Note that you don't have to use *only* FIPS.  There can be many
predicates relating a resource to a normalized code.  E.g., in pseudo-N3
format:

new_york   ogdex:fips55   "1234"
new_york   ogdex:usps     "NY"
new_york   ogdex:census   22

Where new_york is the URI for the state of New York.

> The joy is that I wouldn't have to store any bill or action text, nor
> would Joshua necessarily have to store the entire CONGRESS file or
> any of the "foundation" type files (like FIPS and the derivations
> therefrom) that Pythia holds. Simply by the descriptions in the RDF
> we'd know exactly where to obtain atomic information IF a given
> retrieval required it.  (Am I getting this right?)

Okay, this might be the only thing that you've jumped the gun on.  :)

RDF doesn't indicate where actually to get content.  However, we could
create/find an RDF vocabulary to describe such things.  It's a minor
implementation detail, but it's something RDF itself doesn't address.

> Rather than for me sit here and "make stuff up" to complete the
> CONGRESS example that Joshua suggested, it might be an idea to
> discuss our implementation of a vocabulary first so that not only are
> names for common objects identical throughout the community; the
> points of origin (or authoritativeness) would also become well-known
> and described.

That's a good place to start, but I need a mental break before I suggest
exactly how to begin on that.

Thanks, Bill, for going over these issues in such great detail.  It's a
big help to get everyone on the same page and to get a plan of action
started.

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **

#70 From: Neal McBurnett <neal@...>
Date: Fri Mar 4, 2005 6:09 am
Subject: Re: scraping javascript sites, colorado example
emergent27
Send Email Send Email
 
Here are some spidering/scraping resources I've stumbled upon via
google.  Feedback on any of them would be welcomed:

Python web-client programming
  http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html

  HTMLForms

  dealing with javascript:
   Java's httpunit from Jython, since it knows some JavaScript
   Mozilla automation & XPCOM / PyXPCOM, Konqueror & DCOP / KParts /
   PyKDE

  ssl: Mozilla plugin:  livehttpheaders.
       Use lynx -trace, and filter out the junk with a script.

  http://linux.duke.edu/projects/urlgrabber/

  mozilla plugin can display HTML form information and HTML table
  structure:
  http://chrispederick.myacen.com/work/firebird/webdeveloper/

HTML Screen Scraping: A How-To Document
  http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html
  python, urllib HTMLParser sgrep quixote,

python:  http://www.crummy.com/software/BeautifulSoup/
  http://www.pycs.net/users/0000316/

http://www.oreilly.com/catalog/spiderhks/toc.html
  Very Perl-oriented.
  LWP::Simple
  HTML::TreeBuilder
  WWW::Mechanize
  Template::Extract
  WWW::Yahoo::Groups

XPath


And I did manage to get some colorado legislation pdfs to load
directly, like

  
http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/AB832E748317A609872\
56F3A006A1217/$FILE/002_01.pdf

so they aren't necessarily as difficult as I thought.  But other
similar URLs don't work, so I'm still puzzled.
I used ethereal/FollowTcpStream to see which URLs my browser was
actually retrieving.

-Neal

> One topic I haven't seen much discussion of here is examples or
> techniques for scraping documents.
>
> E.g. I'm interested.in Colorado legislation.  The site
> (http://www.leg.state.co.us/) is pretty unfriendly from an automated
> scraping standpoint, in my experience.  The legislation is in pdf,
> which is a pain, but pdftotext seems to produce moderately scrapable
> text.
>
> Model software (in python?)  for cleaning up the output of pdftotext
> would be welcomed.
>
>
> But finding the pdfs is tricky.  E.g. I start at the list of bills,
> which is designed to use form submission to select successive sets of
> 50 bills:
>
> http://www.leg.state.co.us/Clics2005a/csl.nsf/BillFoldersSenate?openFrameset
>
> Having found the right bill, say I want to look at the original
> version of the bill (with legislative summary).  First we need to go
> to the "All versions" link.  A typical URL, for SB05-079, is
>
> 
http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont2/51B91106B515902487256F\
5E0078CF5A?Open
>
> So they seem to intentionally introduce a hash to obfuscate the
> URL.
>
> Going to that URL gets us to the "Introduced Bill" name: 079_01.pdf
> with an invisible (white) "n" appended to it, visible only when I
> select the area for cut-and-paste.  Who knows why.
>
> Clicking on that goes to a different place than advertised by "Copy
> Link Location" in firefox, via a javascript _doClick() function:
>
>  _doClick('87256EE50072C919.8975551e51fa01d087256dd30080e1d5/$Body/0.390C'...)
>
> Ending up (at least for Firefox) at another useless intermediate web
> page, which has more javascript that automatically downloads the pdf
> (though I'm not sure why, and might be confused by the frame structure).
>
> It also has a link to the "Bill" as
http://www.leg.state.co.us/clics2005a/csl.nsf/fsbillcont3/51B91106B515902487256F\
5E0078CF5A?open&file=079_01.pdf
> but again that is misleading.
>
> When the pdf is downloading, the Firefox dialog box which asks if I
> want to browse it or download it contains this url:
>
> 
http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B5159024872\
56F5E0078CF5A/$FILE/
>
> But if I try to load that via wget I get "400 Bad Request"
> so they may be playing around with cookies or other magic.
>
> The Firefox "Page Info" Links section points to this for the "Current
> PDF Document:
> 
http://www.leg.state.co.us/clics2005a/csl.nsf/billcontainers/51B91106B5159024872\
56F5E0078CF5A?OpenDocument
>
> which again just produces the html page.
>
> Can someone find a URL that just loads the pdf?
> Is this sort of run-around common?  Any good insights or tools to deal with
> it?  Anyone want to scrape, for a start, all the legislative summaries
> for all the bills as introduced :-) ?
>
> Thanks,
>
> Neal McBurnett                 http://bcn.boulder.co.us/~neal/
> Signed and/or sealed mail encouraged.  GPG/PGP Keyid: 2C9EBA60

#71 From: Bill Farrell <jwwf@...>
Date: Fri Mar 4, 2005 9:23 pm
Subject: A working demo
idawannanoe
Send Email Send Email
 
Here is a PHP API for RDF I found in my travels:
http://www3.wiwiss.fu-berlin.de/custom_test.php

A picture being worth a load of postings, this demo will allow one to input a
blob of RDF and to try various queries on it.  Very handy for experimenting with
models.

There are some useful tutorials here:
http://www.wiwiss.fu-berlin.de/suhl/bizer/rdfapi/tests.html

The code is displayed and it's rather easy to see how everything fits together.

There is another tutorial on RDQL here:
http://phpxmlclasses.sourceforge.net/rdql.html

I haven't gotten as far as to look for Perl APIs, since I largely use PHP as an
interface into the UV DB (strictly as a matter of convenience).  Those would be
well worth a post as well, as well as anyone's experiences with them.

Bill

#72 From: Joshua Tauberer / GovTrack <tauberer@...>
Date: Sat Mar 5, 2005 3:36 pm
Subject: Re: scraping javascript sites, colorado example
tauberer
Send Email Send Email
 
Before I reply to Neal-- I've added a section on RDF schemas/ontologies to the
article I wrote:
   http://www.govtrack.us/articles/20050302rdf.xpd

Neal,

I'm gonna try to step through the process of getting the status of
legislation out of the Colorado site, based on the links you gave.  Here
goes.

First, load the page:
http://www.leg.state.co.us/Clics2005a/csl.nsf/(bf-3)?OpenView&Count=50000

I've changed the Count parameter from how the website has it in the
framed version.

To extract bill history, the only things that are relevant are the
History links, which conveniently are in the form:

<A HREF="[[URL]]" target="Bottom2" target="Bottom2">History</A>

You can pick those out those URLs using regular expressions, or even
some simple string manipulation functions.  For each of those links,
load up the URL.

Absurdly enough, this page is a frameset.  So you'll need to do the same
trick of looking for the right URL to load.  In this case, it's the URL
in the line that matches:

    <frame src="[[URL]]" name="File"

Although they have it as a relative URI there, so you'll need to tack on
http://www.leg.state.co.us to the beginning.

Finally I'm at the page with the bill history...  Now you've gotta just
extract the bill number and each action.  What I sometimes do is strip
the HTML out of the page, and then do pattern matching.  So, doing that,
to get to the bill number, you find the line that starts with
"Summarized History for Bill Number " and take whatever follows it on
the line.  To get the actions, just pick out any lines that match the
pattern:
	 DD/DD/DD Whatever...
Which is easy with regular expressions, but, again, also possible by
just testing whether there are digits and slashes in the right indexes
of the string.

Hope some of this is useful.  If you get stuck somewhere, post more.  :)

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **

#73 From: John Labovitz <johnl@...>
Date: Sun Mar 6, 2005 5:21 am
Subject: Re: scraping javascript sites, colorado example
johnlabovitz
Send Email Send Email
 
On Mar 3, 2005, at 10:09 PM, Neal McBurnett wrote:

> Here are some spidering/scraping resources I've stumbled upon via
> google.

There are starting to be some good Ruby libraries for screen-scraping,
too.

There's a simple but good version of WWW::Mechanize (you can find it
via the 'gems' Ruby library if you have that installed).  And REXML is
a fantastic XML parsing library, with XPath built in so you don't have
to do so much procedure stuff as you do with some of the Perl modules.

This won't help much with the Javascript mess, though.  (And yes, I've
found similar awful cruft in dealing with scraping financial services
sites.  I think it must be output of some middle-ware app that folks
use to make web sites.  I had to deal with one recently that had *no*
way of navigating via regular HTML; only Javascript links!  Truly
annoying.)

--
John Labovitz
Macintosh support, research, and software development
John Labovitz Consulting, LLC
johnl@... |  +1 503.949.3492 |
www.johnlabovitz.com/consulting

#74 From: Joshua Tauberer / GovTrack <tauberer@...>
Date: Wed Mar 9, 2005 4:56 pm
Subject: My Trip to D.C.
tauberer
Send Email Send Email
 
Hey, all.  Here's how my trip to D.C. went.  I posted this as a news
item on GovTrack.  More to come in a few days.

----

Last night I got back from a two-day trip to D.C. The point of the trip
was to make a presentation about GovTrack and also to start some
collaboration with others on expanding the political information that is
freely and openly available online.

Monday afternoon I presented GovTrack and some ideas about the semantic
web to the people who are responsible for getting some aspects of
legislative information posted online in XML format. Right now GovTrack
gets its information from screen-scraping, which is an inexact and
fragile process of extracting information out of the same HTML pages
that you see when you view web sites. Having data published also in XML
format can greatly improve the accuracy of getting information. What the
people at the clerk of the House have done to date, in terms of getting
bills written in XML and roll call votes posted in XML, has been a great
step forward, although it hasn't been that useful for GovTrack. (One
reason is the Senate hasn't followed suit because, as I understand it,
the clerk of the Senate isn't authorized by the Senate itself to work on
such things.)

I think I may have bored the attendees a bit. I wasn't sure exactly who
was coming and what their backgrounds were. But, at the least they know
who I am now, and that might mean some collaboration can occur in the
future.

Daniel Bennett of DotGov put the session together -- thanks Daniel!
DotGov runs the websites of some democratic representatives. He and Jeff
Mascott of RightClick Strategies (a pun on 'right' -- it took me a while
to get that) also made presentations, which I enjoyed.

Daniel then took me out to dinner (thanks again!), and Chris and April
from DemocracyInAction came along. DIA is a nonprofit that works with
other nonprofits (on the left) to improve their campaigning
effectiveness, and I was really impressed with how motivated they are.
DIA is gathering legislative information like GovTrack, and we talked
about how we can get our data to relate together, to start building a
network of information that can, for instance, be reused by grassroots
activists to build their websites. (See www.ogdex.info.) Some of the
same type of discussion is happening on the GovTrack mail list.

Yesterday I met with two people from GalleryWatch.com, which provides a
(not-free) service like GovTrack's, but in real time. They constantly
monitor legislative activity and in minutes of an action can update
their clients. I was, again, impressed by how these guys feel strongly
about not just making a business out of this but also giving back, by
providing their service at low cost to some non-commercial entities.
They're interested in taking advantage of semantic web ideas, and I
think they'll help bootstrap the process of building the (free and open)
web.

I think I've met now almost all of the players in the arena of building
this network of political information. Between everyone involved,
including those on the GovTrack mail list, we have enough data and
enthusiasm to get something very unique and useful started.

#75 From: Joshua Tauberer <tauberer@...>
Date: Sat Mar 12, 2005 4:55 pm
Subject: RDF to SQL
tauberer@...
Send Email Send Email
 
This is mainly a follow-up to something Chris from DIA (he's on the mail
list now) and I were talking about Monday, but I think the list at large
would be interested in seeing this.

When I met with Chris on Monday, he raised an important point that
whatever system is used to share data, it should be really easy to use,
in part to encourage people to use it.

Sharing data as raw databases makes it really easy to drop the data into
a website, he suggested.  And, I totally admit that it's much easier
than dealing with RDF.

But, as I responded Monday, once data is in RDF, it's easy to export it
into a database.  Two weeks ago I had been working on an RDF querying
engine (for fun, really, since there are already existing programs to do
this), and this week I added to it an SQL output format.  The result is
the ability to query an RDF data model and output it, more or less, as a
database.

First some background...

GovTrack publishes an RDF version of all of its data in
http://www.govtrack.us/data/rdf/.  You should take a look at the
people.rdf file if you haven't seen it get to get a general idea for the
structure of the data.

You can browse the data at http://www.govtrack.us/rdfbrowse.xpd.  The
browser program itself knows nothing about the type of data that it's
browsing, which is a good example of the advantage of using RDF.  All of
the different types of information magically just come together, with no
glue specific to each type of data.  (The browser uses the RDF schemas
in http://www.govtrack.us/share/ and some labels present in the RDF
files above to display nice names in place of some URIs.)

RDF can be written in XML or Notation 3, among other formats.  There are
N3 versions of the schemas in the share directory if you want to see
what they look like.  N3 is a much simpler format than RDF/XML.  It's
basically just a list of statements: subject predicate object, followed
by a period.

For the query engine that I wrote, the query itself is written as RDF
(in this case as N3).  You give it an RDF graph with some nodes marked
as variables, and the engine tells you the different ways it can match
up (bind) those variables with entities in the target data model.

Ok, the example...

At http://www.govtrack.us/rdfquery.cgi you can try it out.  Although,
admitedly it's difficult coming up with valid queries because the
structure of the data isn't all that simple.

The example queries the data model for all representatives currently
serving in an office.  (Since the data model is pretty rich, it's also
possible to write queries to list the population of each state for any
senator that voted Nay on legislation related to Copyright, for instance.)

Anyway, the idea is that once the data is in RDF, we could come up with
some queries to generate database versions of the information, and then
also publish those.

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **

#76 From: Joshua Tauberer / GovTrack <tauberer@...>
Date: Sun Mar 13, 2005 11:13 pm
Subject: Proposed URIs for Congressional Things
tauberer
Send Email Send Email
 
No matter what system ends up being used for sharing political info
(::cough:: RDF), we'll need some form of common ID system, which we've
talked about on the list before.  Daniel Bennett (from dotgov.info; he's
the one that set up the presentation last week, and he's been working
with government people on XML standards for years) posted some proposed
URIs for Congress-related things on OGDEX (at
http://www.ogdex.info/warehouse/space/Open+Standards).

Here's what he's proposing:

For bills:
    urn:congress.gov:legis-num,109hr2121

For congressional districts:
    urn:congress.gov:cong-dist,109ca14

For legislators:
    urn:congress.gov:legislator,C000191
    C000191 comes from the ID system used by bioguide.congress.gov.

For subject terms that the Library of Congress assigns to bills:
    urn:congress.gov:liv-term,social+security
    This is particularly useful for things like blog aggregation.
    LIV is the Legislative Indexing Vocabulary.

He's talking to Library of Congress people to see what they think.
Anyway, Daniel is on the list now, so I guess he'll update us all if he
hears from the LOC people.

It seems likely that the Bioguide IDs for legislators will be the basis
of whatever system comes out of this, so it'd be good to get a table
together of all of the information on bioguide.  We'll need that to
match up the bioguide IDs with the IDs some of us use internally.

Bill, the congress file you posted with the bioguide IDs
(http://www.progressivenation.net/modules/tinycontent3/index.php?id=15)
only has current legislators, right?  Would it be easy for you to post a
version with records for everyone in bioguide?  (Also, all of the names
in the file are in uppercase.  Can you preserve the casing from Bioguide?)

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **

#77 From: Bill Farrell <jwwf@...>
Date: Mon Mar 14, 2005 3:19 am
Subject: Re: Proposed URIs for Congressional Things
idawannanoe
Send Email Send Email
 
----- Original Message -----
From: Joshua Tauberer / GovTrack <tauberer@...>
To: govtrack@yahoogroups.com
Sent: Sun, 13 Mar 2005 23:13:53 +0000
Subject: [govtrack] Proposed URIs for Congressional Things

(snip)
> Bill, the congress file you posted with the bioguide IDs
(http://www.progressivenation.net/modules/tinycontent3/index.php?id=15)
only has current legislators, right?  Would it be easy for you to post a
version with records for everyone in bioguide?  (Also, all of the names
in the file are in uppercase.  Can you preserve the casing from Bioguide?)

(end snippet)

Hi Joshua,

First, congratulations on your successful venture to DC.  Your efforts are much
appreciated.

So far I only pulled 107th-109th Congress, but it would be no great trick to
scrape the entire thing. I'll be most happy to attend to that this week.

Now, for data presentation:  The short answer is I *can* store and retrieve data
in any fashion.  Here goes the long answer.

You think in terms of individual tables -- I think in terms of holistic mining
systems. Again, products of disparate backgrounds. This, I think is the strength
of this group: our varied skills, experiences and systems.  Each person and site
doing that which each does best.

When any data goes into the UniVerse system, it's all normalized to upper-case. 
It doesn't *have* to, but here's why.  One of the strengths of a post-relational
dbms is that while data are stored in a completely normalized fashion, it can be
presented ANY way you like on output.  Case is one of the attributes that a cook
might describe as "salt to taste" :-) (PS: That's why adding RDF as an output
type is no big deal, once we settle on a vocabulary/format.)

Data-shaping, this retentive-seeming fashion of storing data, ensures that (1)
relational integrity is always enforced and (2) storage follows the
post-relational rule that ANY datum in the entire system can be related to ANY
other datum or group of data in the entire system, on-the-fly and at any moment.
"System" in this instance can be taken to mean "any physical system in our
community".

BTW, JOIN is an odd and foreign concept to me--from where I sit, any
file/attribute/value/subvalue/text-group is intrisically "join"ed to any other,
or to any level of nesting. The object is to allow data to be output in any
immediately-useable format.  I'm hoping that the going community will be able
(by dint of common RDF interchange) to relate any portion of any store in the
network to any other portion of any other store in the network.  Now, THAT
becomes a super-useful and very powerful data mining arrangement.

Individual tables *usually* aren't very useful or interesting as stored.  The
XML example I offered is simply the CONGRESS file with none of the possible
relationships -- that's probably a good foundation point. Once we establish a
foundation point, anything can be added at at time, should we so desire. (PS,
I'm still having trouble finding examples of RDF classes that contain nested
classes.  Any help in finding examples would be greatly appreciated!)

Since from where I sit, ALL data are related, it's entirely possible to deliver
maps, geolocation data, county/state census statistics, etc along with any query
to my system.  The file (or table name, if you prefer) in the query is merely a
starting point. (My hope is that it would never matter exactly where a given
chunk of data resides -- the RDF would describe the drilling points.)  At the
end of the day, all that matters is what we want to present for any given
collection of data.

That's really the difference in having big globs of data and having integrated
decision-support systems.  Having big globs of data sounds impressive, but in
practice and application, they're useless and uninteresting.  Things become
interesting when they relate to one another across an entire realm of
possibilities.  It's my experience that big globs of unrelated data languish
untouched, no matter how well-advertised.  We're looking to publish data that
are immediately useful, which is a fair bit different to offering loads of
static tables.

That's also why I chose the system that I did for this kind of work: it's built
for (1) compact data storage, (2) complete flexibility in data relating, (3)
complete flexibility in output formats and (4) doesn't mind big, bulky or
oddly-shaped items.  It's possible for me to meet ANY output format, *if* I know
what it's supposed to look like as a finished product.  Sometimes it takes a bit
of code; usually it's zero or near-zero effort at all... simply specify what you
want.  It's also a very low-maintenance -- appealing to a lazy but busy man :-)

I'll turn my attention to completing our CONGRESS/BioGuide combined store, while
we kick around ideas for querying and presentation. RDQL looks promising and
translates neatly into RetrieVe or SQL, but any standard we adopt will work fine
for me.

While I'm on about this task: Any thoughts on what we'd like to do with the
BioGuide email addresses that are actually javascript links to input forms? 
Those, I ignored on input for the nonce, but we could go anywhere we'd like with
them.  Should those addresses be included, but flagged as a URI, not a direct
link?  There again, to me that is an output description, rather than a physical
flag database flag.  The question would be how best to state that on output.

Best regards,
Bill
No matter what system ends up being used for sharing political info
(::cough:: RDF), we'll need some form of common ID system, which we've
talked about on the list before.  Daniel Bennett (from dotgov.info; he's
the one that set up the presentation last week, and he's been working
with government people on XML standards for years) posted some proposed
URIs for Congress-related things on OGDEX (at
http://www.ogdex.info/warehouse/space/Open+Standards).

Here's what he's proposing:

For bills:
   urn:congress.gov:legis-num,109hr2121

For congressional districts:
   urn:congress.gov:cong-dist,109ca14

For legislators:
   urn:congress.gov:legislator,C000191
   C000191 comes from the ID system used by bioguide.congress.gov.

For subject terms that the Library of Congress assigns to bills:
   urn:congress.gov:liv-term,social+security
   This is particularly useful for things like blog aggregation.
   LIV is the Legislative Indexing Vocabulary.

He's talking to Library of Congress people to see what they think.
Anyway, Daniel is on the list now, so I guess he'll update us all if he
hears from the LOC people.

It seems likely that the Bioguide IDs for legislators will be the basis
of whatever system comes out of this, so it'd be good to get a table
together of all of the information on bioguide.  We'll need that to
match up the bioguide IDs with the IDs some of us use internally.

Bill, the congress file you posted with the bioguide IDs
(http://www.progressivenation.net/modules/tinycontent3/index.php?id=15)
only has current legislators, right?  Would it be easy for you to post a
version with records for everyone in bioguide?  (Also, all of the names
in the file are in uppercase.  Can you preserve the casing from Bioguide?)

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **




#78 From: Scott Beardsley <sc0ttbeardsley@...>
Date: Mon Mar 14, 2005 2:29 am
Subject: Re: Proposed URIs for Congressional Things
sc0ttbeardsley
Send Email Send Email
 
--- Joshua Tauberer / GovTrack <tauberer@...>
wrote:

> It seems likely that the Bioguide IDs for
> legislators will be the basis
> of whatever system comes out of this, so it'd be
> good to get a table
> together of all of the information on bioguide.

What about politicians that never actually get voted
into a legislative position? I'd like to see a system
emerge that can handle referencing failed campaigns as
well.

Also, what about politicians that have held multiple
positions in government?

For politician ID's I think we need to use a global id
and not one tied directly to the House or Senate. This
global ID should allow easy state data integration. If
the BioGuide ID is available for a legislator we
should store it but I think we should move away from
relying on it as our unique politician id.

Scott



__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/

#79 From: Bill Farrell <jwwf@...>
Date: Mon Mar 14, 2005 6:01 pm
Subject: BioGuide - Step I
idawannanoe
Send Email Send Email
 
Hello All,

Here's step 1 of the next BioGuide project.

Part 1 is a complete list of states and former territories:
http://pythia.progressivenation.net/smxml/states.php . This URI will return an
XML rendition.

The key to the states table is the 2-character FIPS abbreviation, with which
we're all familiar.  The Name attribute is the official spelling.  The
FIPSNumber attribute is the Federal Information Processing Standard number. 
This identifier is key to many, many other government files and indices.  This
number can also be used to select the FIPS55 file and the Places2K file by code,
leading you into county and census information (thence to geolocation data).

Finally, the IsCurrent attribute will contain a "1" (0x30) if the named entity
is currently a state.  This allows us to use official abbreviations that are
also employed in the BioGuide (ie, "DK" for Dakota Territory).

Part 2 is the list of political parties from BioGuide:
http://pythia.progressivenation.net/smxml/politicalparties.php
Again, an XML rendition will be returned.

I'll publish more normalizing tables as needed until we get the whole BioGuide.

Best regards,
Bill

#80 From: Joshua Tauberer / GovTrack <tauberer@...>
Date: Mon Mar 14, 2005 11:35 pm
Subject: Re: Proposed URIs for Congressional Things
tauberer
Send Email Send Email
 
Scott Beardsley wrote:
> --- Joshua Tauberer / GovTrack <tauberer@...>
> wrote:
>>It seems likely that the Bioguide IDs for
>>legislators will be the basis
>>of whatever system comes out of this
> ...
> For politician ID's I think we need to use a global id
> and not one tied directly to the House or Senate. This
> global ID should allow easy state data integration.

That's a good point.  Something to keep in mind is that other schemes
can be used to label people not in Congress, but the problem with that
is how to label someone, as you said, that's had two positions in
government.

Or, another way to look at that problem, is how do we indicate that two
IDs refer to the same person?

I thought about this a bit this morning.  For sure we're not going to
come up with a unified system for identifying every individual in the
universe, which means at some point someone is going to get two IDs, and
we need a table saying which IDs are equal if we want the data to work
together.  But, ideally, this table should be as decentralized as the
original naming system, and that means applications will need to know
which tables to trust.  An arbitrary person shouldn't be able to post a
file that says URI X = URI Y and cause, e.g., all blog aggregators to
work accordingly.

I posted a message yesterday on the W3's Semantic Web mail list
introducing GovTrack's RDF data, this mail list, and OGDEX.
(http://lists.w3.org/Archives/Public/semantic-web/2005Mar/0065.html)  I
may post another message there asking for comments on this issue of
naming things.

My inkling is that the way to go is to adopt any system now for IDing
things, and eventually there will be a framework for tying IDs together.

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **

#81 From: Bill Farrell <jwwf@...>
Date: Mon Mar 14, 2005 11:46 pm
Subject: Re: Proposed URIs for Congressional Things
idawannanoe
Send Email Send Email
 
----- Original Message -----
From: Joshua Tauberer / GovTrack <tauberer@...>
To: govtrack@yahoogroups.com
Sent: Mon, 14 Mar 2005 23:35:30 +0000
Subject: Re: [govtrack] Proposed URIs for Congressional Things


Here is an example of proper normalization.  BioGuide should probably be a good
starting-point.  If we come to a common RDF vocabulary, there should be no
reason not to enforce data integrity at least within the OGDEX community.  Isn't
that why we're bothering?  In a properly normalized and integrity-enforced
system, there shouldn't be dual ID's unless a state or local data producer has
identified a person differently to LOC.  In that case, creating a cross-index
STILL should be no trick.  Just thinkin' out loud...

Speaking of which, I've got all the BioGuide info downloaded and am rethinking
the original CONGRESS file a bit, opening up a couple attributes for queries
that should satisfy any sort of search for a person ID.  I've also grouped the
dependant attributes together, just for neatness' sake ( {Session, SH, Position}
are now in attributes 19-21, not that it will ever matter on output).

The "Vacancy" attributes should probably be dropped, since they really refer
more to a "seat" than a person.  They're not used in the current version of
xml.house.gov's rendition.  Perhaps a separate table that describes specific
Congressional positions?  Thoughts?

I should have the complete BioGuide ready late tonight or tomorrow.

Best!
Bill
Scott Beardsley wrote:
> --- Joshua Tauberer / GovTrack <tauberer@...>
> wrote:
>>It seems likely that the Bioguide IDs for
>>legislators will be the basis
>>of whatever system comes out of this
> ...
> For politician ID's I think we need to use a global id
> and not one tied directly to the House or Senate. This
> global ID should allow easy state data integration.

That's a good point.  Something to keep in mind is that other schemes
can be used to label people not in Congress, but the problem with that
is how to label someone, as you said, that's had two positions in
government.

Or, another way to look at that problem, is how do we indicate that two
IDs refer to the same person?

I thought about this a bit this morning.  For sure we're not going to
come up with a unified system for identifying every individual in the
universe, which means at some point someone is going to get two IDs, and
we need a table saying which IDs are equal if we want the data to work
together.  But, ideally, this table should be as decentralized as the
original naming system, and that means applications will need to know
which tables to trust.  An arbitrary person shouldn't be able to post a
file that says URI X = URI Y and cause, e.g., all blog aggregators to
work accordingly.

I posted a message yesterday on the W3's Semantic Web mail list
introducing GovTrack's RDF data, this mail list, and OGDEX.
(http://lists.w3.org/Archives/Public/semantic-web/2005Mar/0065.html)  I
may post another message there asking for comments on this issue of
naming things.

My inkling is that the way to go is to adopt any system now for IDing
things, and eventually there will be a framework for tying IDs together.

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **




#82 From: Bill Farrell <jwwf@...>
Date: Tue Mar 15, 2005 12:04 am
Subject: Re: BioGuide - Step I
idawannanoe
Send Email Send Email
 
DOH!  Left one out:  POLITICAL_ROLES

http://www.progressivenation.net/smxml/politicalroles.php

>
Hello All,

Here's step 1 of the next BioGuide project.

Part 1 is a complete list of states and former territories: http://pythia.progressivenation.net/smxml/states.php . This URI will return an XML rendition.

The key to the states table is the 2-character FIPS abbreviation, with which we're all familiar.  The Name attribute is the official spelling.  The FIPSNumber attribute is the Federal Information Processing Standard number.  This identifier is key to many, many other government files and indices.  This number can also be used to select the FIPS55 file and the Places2K file by code, leading you into county and census information (thence to geolocation data).

Finally, the IsCurrent attribute will contain a "1" (0x30) if the named entity is currently a state.  This allows us to use official abbreviations that are also employed in the BioGuide (ie, "DK" for Dakota Territory).

Part 2 is the list of political parties from BioGuide:
http://pythia.progressivenation.net/smxml/politicalparties.php
Again, an XML rendition will be returned.

I'll publish more normalizing tables as needed until we get the whole BioGuide.

Best regards,
Bill



#83 From: Scott Beardsley <sc0ttbeardsley@...>
Date: Tue Mar 15, 2005 12:38 am
Subject: Re: Proposed URIs for Congressional Things
sc0ttbeardsley
Send Email Send Email
 
--- Bill Farrell <jwwf@...> wrote:

> In a properly
> normalized and integrity-enforced system, there
> shouldn't be dual ID's unless
> a state or local data producer has identified a
> person differently to
> LOC.  In that case, creating a cross-index STILL
> should be no trick.

So for my state election results (that reference both
failed and successful campaigns at multiple levels of
gov) I should be creating my own ID system or no?

I'm currently using something similar to (for all
people I track):

ogdex:us/ca/election/20041102/us/senate/boxer

which will later be tied (for successful campaigns)
to:

ogdex:us/senate/109/person/B000711

I'm not against using the BioGuide I'm just trying to
figure out how I am going to link my data (for which
I'm getting close to this issue).

> Perhaps a separate table
> that describes specific Congressional positions?
> Thoughts?

I've already started something similar for my stuff. I
think it's a good idea.

Scott



__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/

#84 From: Bill Farrell <jwwf@...>
Date: Tue Mar 15, 2005 1:12 am
Subject: Re: Proposed URIs for Congressional Things
idawannanoe
Send Email Send Email
 
Hey, Scott,

If there is no other resort, my thought is that you would have the authoritative
table for CA state- and local-level results. If CA isn't identifying their
legislators the way BioGuide does for federal, you may be "it" -- the authority.
I think for federal level, we may as well take advantage of BioGuide's IDs,
since they point to a known quantity in a known place
(URI->http://bioguide.congress.gov/scripts/biodisplay.pl?index=XXXXXX for
example), and they also tie to Joshua's federal-level scrapings.

Here's where RDF interchange gets beautiful.  When you publish your state- and
local-level IDs table, we'll all know exactly how to get at it and extract
information. Boom, OGDEX expands its mining capabilities as a whole -- very,
very useful indeed.  The local IDs would be one gateway into your system, much
as the federal ID's would be one gateway into Joshua's.

If we do this right, a researcher should be able to start at ANY point in the
community and continue drilling to any other point, regardless of the actual
residence of the data.

It might be messy to try to integrate the state and local ID's into the BioGuide
system, so we might want to kick around some ideas on how we should describe
governmental tiering in RDF.

Best!
Bill

----- Original Message -----
From: Scott Beardsley <sc0ttbeardsley@...>
To: govtrack@yahoogroups.com
Sent: Tue, 15 Mar 2005 00:38:20 +0000
Subject: Re: [govtrack] Proposed URIs for Congressional Things


>


--- Bill Farrell <jwwf@...> wrote:

> In a properly
> normalized and integrity-enforced system, there
> shouldn't be dual ID's unless
> a state or local data producer has identified a
> person differently to
> LOC.  In that case, creating a cross-index STILL
> should be no trick. 

So for my state election results (that reference both
failed and successful campaigns at multiple levels of
gov) I should be creating my own ID system or no?

I'm currently using something similar to (for all
people I track):

ogdex:us/ca/election/20041102/us/senate/boxer

which will later be tied (for successful campaigns)
to:

ogdex:us/senate/109/person/B000711

I'm not against using the BioGuide I'm just trying to
figure out how I am going to link my data (for which
I'm getting close to this issue).

> Perhaps a separate table
> that describes specific Congressional positions?
> Thoughts?

I've already started something similar for my stuff. I
think it's a good idea.

Scott


           
__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/





#85 From: "DD" <citizencontact@...>
Date: Wed Mar 16, 2005 3:43 pm
Subject: Re: Proposed URIs for Congressional Things
citizencontact
Send Email Send Email
 
Much of the discussion regarding the URN's and Bioguide ID's have to
do with whether they can track to the people in their roles elsewhere
in life or government. Having taken part in conversations with various
organizations and with the forumlation of digital
certificates/signatures rules, I would like to propose a way of
thinking of the problems of tracking people in organizations.

First, tracking organizational structure is often an integral part of
established organizations. In the case of Congress, there is a 200+
year history, with rules being passed every two years. The Clerk of
the House and the Secretary of the Senate are responsible for
capturing that information for the organization itself. The role and
conflict of elected legislators and their role as representative of a
specific geographical area makes for a complex interaction. Within
Congress it is possible (and sometimes inevitable) for a person to
represent more than one congressional district (esp. due to census
redraws). For this reason it is important to track both the district
in time and the people separately.

The Senate has its own idiosyncracies. So it is very important to
allow such organizations naturally create their own tracking and
naming conventions. And it would be out of their scope to track what
other organizations do to track the same people in their roles in
business, NGO's, and state governments.

So I would always give the benefit of the doubt to Congress and other
well established organizations the "job" of creating their own
systems. It would be OGDEX's mission, perhaps, to allow citizens to
track legislators to their other roles.

Another huge advantage to opting for the Congress's own naming
convention is that it will make plumbing their other information
stores easier and more logical. Just imagine if instead of tracking
Congressional District numbers, we used the zip+4 or GIS information
and had to do constant imperfect translations to get at data.

I will get back with feedback to the suggested URN's from various
congressional groups.
Daniel

#86 From: "DD" <citizencontact@...>
Date: Wed Mar 16, 2005 4:00 pm
Subject: URN's and other things
citizencontact
Send Email Send Email
 
In investigating the dozens of RFC's regarding URN's I am realizing
the potential advantages and some of the pratfalls.

Based on a suggestion of one of an aggregator company tech guy, I have
been investigating the use of the <cite> tag to allow for citations
for congressional legislation. I had been working on a
tagging/namespace system for legislation previously. Examples below:

<uscongress:legis-num><uscongress:congress>109</uscongress:congress>
<uscongress:bill-type>H.R.</<uscongress:bill-type><uscongress:bill-num>
1000</uscongress:bill-num><uscongress:bill-status>ih
</uscongress:bill-status></uscongress:legis-num>


<cite id="urn:congress.gov:legisnum:109hr1000ih">H.R. 1000</cite>

I have found it relatively trivial to convert between the two. I have
found some advantages to the <cite> tag in terms of understanding that
the legislation is a document with a name. Either system allows for
making links to related to the document and since there is no official
online legislation (there is an existential understanding of what a
bill is, printed versus electronic versus the cartoon version--"I'm
only a bill, sitting here on Capitol Hill").

The use of the ISBN URN for books seems to be coming (and necessary to
stop the onslaught of commericially determined linking through
Microsoft SmartTags and Google Toolbar links).

The URN is likely to show up within Dublin Core tags including
dc:identifier and dc:subject, as well as in id tags for <cite> and
other CSS/JavaScript objects. This is as opposed to VCARD RDF
dc:creator possibilities where more than a name is to be conveyed.

Just some thoughts.
Daniel

#87 From: Joshua Tauberer / GovTrack <tauberer@...>
Date: Thu Mar 17, 2005 4:11 pm
Subject: Re: Proposed URIs for Congressional Things
tauberer
Send Email Send Email
 
Bill Farrell wrote:
> First, congratulations on your successful venture to DC.  Your
> efforts are much appreciated.

Thanks, Bill.

> When any data goes into the UniVerse system, it's all normalized to
> upper-case.  It doesn't *have* to, but here's why.  One of the
> strengths of a post-relational dbms is that while data are stored in
> a completely normalized fashion, it can be presented ANY way you like
> on output.

Except, it's not (always) possible to recover case after it's been
normalized.

What's the rule to go from upper case to regular case?  A first try
would be keep only the first letter uppercase.  But, consider these names:
    Millender-McDonald  =>  Have to revise the rule: Letters after hypens
must be capitalized, and "Mc" has to be figured out.
    LaFalce => I challenge you to come up with a rule for this one. :)

  > (PS, I'm still having
> trouble finding examples of RDF classes that contain nested classes.
> Any help in finding examples would be greatly appreciated!)

Not sure exactly what you mean.

> Things become interesting when they relate to one
> another across an entire realm of possibilities.

Yeah, that's exactly what excites me about all of this.

> While I'm on about this task: Any thoughts on what we'd like to do
> with the BioGuide email addresses that are actually javascript links
> to input forms?

Where do you see that?  I wouldn't worry about it for now.

> There shouldn't be dual ID's unless a state or local data producer has
identified a person differently to LOC.

For sure this will happen at some point no matter how hard we might try
to prevent it.  State-level politicians, who might be given an ID as
soon as someone notices them, go on to become members of congress, where
they will be given a bioguide ID.

I think we need to go on assuming that the problem of multiple IDs for
the same person will be solved in the future.  Scott -- I'd say continue
with your own naming system, as will GovTrack, until we can figure out
what to do.

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **

#88 From: Bill Farrell <jwwf@...>
Date: Thu Mar 17, 2005 9:48 pm
Subject: Re: Proposed URIs for Congressional Things
idawannanoe
Send Email Send Email
 
I'd answer in-line, but my webmail insists on sending the original message as an
attachment regardless of settings, grrrr...

I've gotten all of the BioGuide scraped and in the system.  The next chorelet (a
pretty easy one, since I've done it before) is to meld the mbr107.xml into the
store.  I've revamped the file layout quite a bit, having gathered neat tables
of every political party and official position/role.  Those side tables are
great for normalization and are useful on their own.  Matter of fact, I wound up
with several new support tables.  I'm documenting as I go and will have examples
ready for the group in a day or three.

I "disremembered" the javascript contact forms.  Those are on
http://www.senate.gov/general/contact_information/senators_cfm.cfm - a perfectly
beastly page to scrape.  I'll that puppy in while I'm re-casing the CONGRESS
file.  Same deal goes for the representative-to-district mapping.

Luckily, those slide right into Excel 2003 politely and comes out as a nice text
file.  Piece of cake to parse.  Literally a two-step.  Now that I look at the
flat rendition, I think I'll add an E-Contact field to house the contact form
URLs in addition to the Email field in CONGRESS. We can decide how to react on
query later.

The rep-to-district mapping is rather a nice find, since that relates the
Congress file directly to the FIPS and the Census Places2K files already
on-hand.  That is, if you know a rep's surname, for instance, you can find all
the cities, towns, etc in his/her district.  Handy stuff to know, particularly
if one wants to find money-trails.  I've also got the rep-to-committee mapping
and the complete phone list.  The senate phone list is in PDF, but it looks as
if that will pose little problem.

BTW, I've been using a very handy little free tool, pdf2html for 3 or 4 years
now. PDF's that are mostly text move over quite nicely and HTML is a heckuva lot
easier to parse out.  If anyone wants a copy, I think I have the original tgz
file.  I'll go look for it when I get home tonight.  It's written in Perl, so it
should work for about everyone on the list.

The only counter to the don't-normalize proposition I can offer is that
Millender-McDonald is absolutely not the same as MILLENDER-MCDONALD.  Now that I
think about it, making an upper-case virtual field for searches is zero problem.
It doesn't cost the system any more or less to use a virtual field for searches.
Piece o' cake.  I'll rerun the file-builder and leave case alone.  Since the
nice folks running BioGuide did the hard work of pretty-printing, we'll continue
to let them :-D

Insofar as nested classes go, I've flattened the main CONGRESS file somewhat. 
CONGRESS now only contains biographic/demographic information.  There is a new
file, CONGRESS_ROLES which IS multivalued, though -- little way out of that. 
The CONGRESS_ROLES file is keyed to the CONGRESS file... same item-id's.  For
each legislator there is a single record with two dependant attributes: SESSION
and ROLE.  For every value in the SESSION attribute there is an equally-ranked
value in ROLE.  With one read, you have each legislator's history.

If it were expressed in XML (actually an accurate representation of the
internals), it would look something like:

<congress_roles>
   <congress_roles_record id="D000443">
     <roles>
       <role sessionid="25">DELEGATE</role>
       <role sessionid="26">DELEGATE</role>
       <role sessionid="31">REPRESENTATIVE</role>
       <role sessionid="32">REPRESENTATIVE</role>
     </roles>
   </congress_roles_record>
   <congress_roles_record id="L000402">
     <roles>
       <role sessionid="24">REPRESENTATIVE</role>
       <role sessionid="25">REPRESENTATIVE</role>
     </roles>
   </congress_roles_record>

(yaddayadda)

</congress_roles>

Of course, a virtual field in the CONGRESS dictionary (currently called ROLES)
would return the same information, but nested within a congress_record
structure.  A class within a class.  Each legislator's set of roles would be one
class (like above), but the same information can be delivered with a query to
CONGRESS, which would also be a class of data.

Again, it doesn't matter where the data physically live -- it's more of a matter
of describing what effects we wish to produce against what kinds of queries. 
The whole reason for normalization is to ensure that any set of data can relate
absolutely and unerringly to any other.  As currently designed, one can walk
easily into and out of our bioguide to geolocation data, to census data, etc. 
Any of the other data can be presented concomitantly with ANY query.  Call it a
bigole cloud, but one in which you know the exact location, shape, and value of
any single droplet within it.

As I said, I'm multilingual in human and computer languages... but learning RDF
is a lot like learning the same-old same-old post-relational data concepts
(ex-instructor, know 'em by heart) all over again.  But in Mandarin.  Which I
don't speak :-)  Right now I'm concentrating on getting the data IN... I'll
worry about getting it out in RDF as we get to it.

I'm working on a universal query front-end, which I expect to have done in a few
days.  We can begin to look at the treasure-trove and evaluate how we want to
express the contents of the stores.  This project is going along-side the
BioGuide/Congress project.  There's enough of the web interface available so
that we can all look at the dictionary structures to see what kinds of data are
available and how we'd like to package it.  All the files in the lineup are
inter-related, such that a query on any one can as easily return data from any
other(s).  I've opened it up for examination: http://www.progressivenation.net
"Choose a Research Area".  For now, you can see a table layout for each of the
published files.  The query link works through to the final page -- if you push
the "Report" button, you'll not be really happy just yet.  It's a place to start
discussion, though.

There's where I am at the moment.  Full plate and counting.

Best regards,
Bill
Bill Farrell wrote:
> First, congratulations on your successful venture to DC.  Your
> efforts are much appreciated.

Thanks, Bill.

> When any data goes into the UniVerse system, it's all normalized to
> upper-case.  It doesn't *have* to, but here's why.  One of the
> strengths of a post-relational dbms is that while data are stored in
> a completely normalized fashion, it can be presented ANY way you like
> on output.

Except, it's not (always) possible to recover case after it's been
normalized.

What's the rule to go from upper case to regular case?  A first try
would be keep only the first letter uppercase.  But, consider these names:
   Millender-McDonald  =>  Have to revise the rule: Letters after hypens
must be capitalized, and "Mc" has to be figured out.
   LaFalce => I challenge you to come up with a rule for this one. :)

> (PS, I'm still having
> trouble finding examples of RDF classes that contain nested classes.
> Any help in finding examples would be greatly appreciated!)

Not sure exactly what you mean.

> Things become interesting when they relate to one
> another across an entire realm of possibilities.

Yeah, that's exactly what excites me about all of this.

> While I'm on about this task: Any thoughts on what we'd like to do
> with the BioGuide email addresses that are actually javascript links
> to input forms?

Where do you see that?  I wouldn't worry about it for now.

> There shouldn't be dual ID's unless a state or local data producer has identified a person differently to LOC.

For sure this will happen at some point no matter how hard we might try
to prevent it.  State-level politicians, who might be given an ID as
soon as someone notices them, go on to become members of congress, where
they will be given a bioguide ID.

I think we need to go on assuming that the problem of multiple IDs for
the same person will be solved in the future.  Scott -- I'd say continue
with your own naming system, as will GovTrack, until we can figure out
what to do.

--
- Joshua Tauberer

http://taubz.for.net

** Nothing Unreal Exists **




Messages 59 - 88 of 1188   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help