Re: [govtrack] Bill References in Older Congresses
- Looking through FDSys' bill collection (which is the official public source for bill XML), for the 105th Congress, it does appear that older bills aren't available in XML.I've been doing some work on extracting cross references to the US Code out of bills over the past year. I added support to Scout (Sunlight's bill search/alert system) for smart citation search (example) by first making a general-purpose citation extraction tool over at github.com/unitedstates/citation.That extractor operates on arbitrary text, and so my approach is to run the plaintext versions of bills through it and store all the matches and surrounding excerpts for later reference, which is used to filter search results.More recently, I started working with XML versions of bills in order to turn them into something I could actually integrate and display in Scout. In this example (H.R. 1728), it's using HTML that I transformed from the original XML.The XML is transformed using some code at github.com/unitedstates/documents. All that code does right now is swap all the XML tags for div's and span's, and removes a bunch of extraneous attributes. It also looks at any external-xref tags, plucks out the pieces of citations in external-xref tags, and stores them in data attributes. Turning those tags into links to other parts of Scout is actually done within Scout, client-side - the unitedstates/documents code is meant to be general purpose.Right now, I'm allowing Congress' external-xref tags to pass through and become links, rather than doing the citation detection myself -- however, Congress is apparently not very good at it, and I see a lot of unlinked cites. (This is in contrast to the Federal Register, which does a great job at detection [original].)So, at some point soon I'm planning to remove Congress' cites, run the XML through unitedstates/citation myself, and have it find and mark up the cites (it's a new feature I built for dccode.org/browser that I haven't documented yet).-- EricOn Mon, May 6, 2013 at 4:54 PM, zhynes16 <zhynes16@...> wrote:
Are the clean XML versions of bill full texts currently unavailable for Congresses prior to the 108th Congress? I am trying to pull out the cross-references to different parts of the U.S. Code with the external-xref tags in the XML files, but it appears that the XML files are not in place for the older Congresses.
Seems like the references take on a standard form that I can extract with a regex search through the HTML files, but I was wondering if anyone else had already done anything to extract these references and put them in a more convenient form.
Thanks for your help!
Yahoo! Groups Links
<*> To visit your group on the web, go to:
<*> Your email settings:
Individual Email | Traditional
<*> To change settings online go to:
(Yahoo! ID required)
<*> To change settings via email:
<*> To unsubscribe from this group, send an email to:
<*> Your use of Yahoo! Groups is subject to:
--Developer | sunlightfoundation.com