Looking through FDSys' bill collection (which is the official public source for bill XML), for the 105th Congress
, it does appear that older bills aren't available in XML.
I've been doing some work on extracting cross references to the US Code out of bills over the past year. I added support to Scout
(Sunlight's bill search/alert system) for smart citation search (example
) by first making a general-purpose citation extraction tool over at github.com/unitedstates/citation
That extractor operates on arbitrary text, and so my approach is to run the plaintext versions of bills through it and store all the matches and surrounding excerpts for later reference, which is used to filter search results.
More recently, I started working with XML versions of bills in order to turn them into something I could actually integrate and display in Scout. In this example
(H.R. 1728), it's using HTML that I transformed from the original XML.
The XML is transformed using some code at github.com/unitedstates/documents
. All that code does right now is swap all the XML tags for div's and span's, and removes a bunch of extraneous attributes. It also looks at any external-xref tags, plucks out the pieces of citations in external-xref tags, and stores them in data attributes. Turning those tags into links to other parts of Scout is actually done within Scout, client-side - the unitedstates/documents code is meant to be general purpose.
Right now, I'm allowing Congress' external-xref tags to pass through and become links, rather than doing the citation detection myself -- however, Congress is apparently not very good at it, and I see a lot
of unlinked cites. (This is in contrast to the Federal Register, which does a great job
at detection [original
So, at some point soon I'm planning to remove Congress' cites, run the XML through unitedstates/citation myself, and have it find and mark up the cites (it's a new feature I built for dccode.org/browser
that I haven't documented yet).