Loading ...
Sorry, an error occurred while loading the content.

Re: [govtrack] Bill text problems?

Expand Messages
  • Eric Mill
    I ve been looking exactly for sitemap files like that! Would you mind sharing how we can find the different sitemaps? For example, I guessed at the URL for the
    Message 1 of 9 , Dec 10, 2011
    • 0 Attachment
      I've been looking exactly for sitemap files like that! Would you mind
      sharing how we can find the different sitemaps?

      For example, I guessed at the URL for the one for public and private laws:
      http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_PLAW_sitemap.xml

      But that file is very small and doesn't list what you would need to
      effectively spider the PLAW collection without scraping their HTML.

      As for text of bills -- I actually came to that realization yesterday
      myself, that the GPO .txt files were probably better. I definitely
      would not mind you switching over to them - I can adjust my regular
      expressions (just for sanitization, not extracting data) accordingly.

      -- Eric

      On Sat, Dec 10, 2011 at 12:51 PM, Josh Tauberer <tauberer@...> wrote:
      > Hi, everyone.
      >
      > Bill text is updating now.
      >
      > Thanks to whoever here forwarded the problem on to GPO --- I got an email
      > from someone at GPO who pointed me to their sitemap files, e.g.:
      > http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_BILLS_sitemap.xml (warning:
      > BIG file). I'm checking on how bills are split by year (by publication
      > date?), but this seems to be the most helpful way to find them all.
      >
      > Btw, Eric- For indexing bill text, it might be better to use the original
      > text files from GPO. The .txt files on GovTrack are generated using
      > pdftotext and have line numbers, whereas the GPO original .txt files do not
      > (I imagine they are generated from the XML or GPO locator codes files
      > directly).
      >
      > I don't use my own .txt files except to display historical bill text, and
      > unless there's an objection I could replace the pdftotext-generated files
      > with the GPO original .txt files.
      >
      > Any objections from anyone?
      >
      >
      > - Josh Tauberer
      > - GovTrack.us / POPVOX.com
      >
      > http://razor.occams.info | www.govtrack.us | www.popvox.com
      >
      > On 11/29/2011 10:25 AM, Eric Mill wrote:
      >>
      >>
      >>
      >> I use a combination of three files for each bill. Primarily, the .txt,
      >> for the text. I'm only storing the text en masse for full text search,
      >> not storing the semantic hierarchy of the bill. Secondarily, I use the
      >> MODS XML metadata to get what date the bill version was issued on, a
      >> pretty critical piece of data. However, sometimes the MODS file doesn't
      >> exist, and I use the .xml (HTML) version of the bill as a backup source
      >> for the issued date -- which, now that I look at the code, makes use of
      >> the Dublin Core metadata that you add on top of the original bill data.
      >> I don't make use of the PDF.
      >>
      >> My code that does all this is here, btw:
      >>
      >> https://github.com/sunlightlabs/realtimecongress/blob/master/tasks/bill_text_archive/bill_text_archive.rb
      >>
      >> I understand that this is less vital, but I mean it when I say the rsync
      >> is incredibly useful -- so much so that if you left it offline, what I'd
      >> probably do is set up a separate dedicated GPO bulk data mirroring
      >> service for at least bill text, that supported rsync, and use that
      >> internally. That's a lot of work, though! If you're continuing to use
      >> the GPO's bill text files in your own work on POPVOX, you'd do the
      >> community a service by continuing to make that work available.
      >>
      >> -- Eric
      >>
      >> On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...
      >> <mailto:tauberer@...>> wrote:
      >>
      >>        the bill text is a less vital service, since you just
      >>        repackage what GPO offers
      >>
      >>
      >>    Exactly. That's why I'm not particularly concerned about dropping
      >>    this since it doesn't do much to begin with and after 5+ years of
      >>    running the bill text scraper it's past time to rethink what's
      >>    useful. (Btw, it does also scrape the HTML bill text on THOMAS,
      >>    which is slightly less trivial, but still pretty trivial.)
      >>
      >>    Do you use the PDFs or HTML (or .txt?)?
      >>
      >>    (Clearly when I said "free loading" I was not referring to what I
      >>    agree is a simple repackaging of PDFs.)
      >>
      >>
      >>    - Josh Tauberer
      >>    - GovTrack.us / POPVOX.com
      >>
      >>    http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
      >>    | www.popvox.com <http://www.popvox.com>
      >>
      >>    On 11/29/2011 09:30 AM, Eric Mill wrote:
      >>
      >>
      >>
      >>        I make use of the bill text that GovTrack provides in Sunlight's
      >>        data
      >>        services (our Real Time Congress API) and in the apps that
      >>        depend on it
      >>        (including our Congress app). We load it into ElasticSearch
      >>        (recommended, btw) and we power our search and highlighting with
      >>        it. I'm
      >>        imminently about to document this full text search capability
      >>        and offer
      >>        it to the public.
      >>
      >>        Unlike bill metadata, where you've done God's work and scrapes
      >>        THOMAS
      >>        all day every day, the bill text is a less vital service, since
      >>        you just
      >>        repackage what GPO offers and provide it via rsync. This is an
      >>        incredibly useful way to provide it though! I'd like it to stick
      >>        around.
      >>
      >>        I'm not sure it's possible to "free ride" on free, CC0-licensed,
      >>        repackaged versions of public domain government data. If you
      >>        feel like
      >>        people have been insufficiently thankful for your work or
      >>        haven't given
      >>        enough attribution, that is a more valid and specific
      >>        conversation to
      >>        have than accusing folks who are asking about the status of your
      >>        public
      >>        data on your public mailing list of competing with your business.
      >>
      >>        -- Eric
      >>
      >>        On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer
      >>        <tauberer@... <mailto:tauberer@...>
      >>        <mailto:tauberer@... <mailto:tauberer@...>>> wrote:
      >>
      >>            I've been meaning to write about this.
      >>
      >>            About two weeks ago GPO stopped updating GPO Access, which
      >>        was their
      >>            system for publishing documents since the mid 90s. New bills
      >>        and other
      >>            documents are only being published in FDSys now, and
      >>        GovTrack isn't
      >>            pulling from FDSys because FDSys didn't exist when I wrote
      >>        the bill text
      >>            scraper.
      >>
      >>            Since I've been focused on POPVOX lately, I haven't had a
      >>        chance to
      >>            build a new scraper for GovTrack, although in anticipation
      >>        of this I've
      >>            been working on reimplementing much of the same
      >>        functionality on POPVOX.
      >>            I'm not sure what if any of that code will be open, though
      >>        we have an
      >>            experimental API for it now.
      >>
      >>            It would be helpful to know who else, if anyone, is using
      >>        bill text so I
      >>            can plan the future of GovTrack's bill text accordingly.
      >>
      >>            But I will say that folks free riding on my data and using
      >>        it to compete
      >>            with my business (i.e. POPVOX) get no sympathy from me.
      >>
      >>            - Josh Tauberer
      >>            - GovTrack.us / POPVOX.com
      >>
      >>        http://razor.occams.info | www.govtrack.us
      >>        <http://www.govtrack.us> <http://www.govtrack.us>
      >>            | www.popvox.com <http://www.popvox.com>
      >> <http://www.popvox.com>
      >>
      >>
      >>
      >>            On 11/29/2011 02:12 AM, jlundigard wrote:
      >>         > Hey all,
      >>         >
      >>         > We've noticed the we stopped receiving bill text from govtrack.
      >>              It seems to have stopped around this bill:
      >>         >
      >>         > http://www.govtrack.us/__congress/bill.xpd?bill=s112-__1788
      >>
      >>        <http://www.govtrack.us/congress/bill.xpd?bill=s112-1788>
      >>         >
      >>         > That bill and more recently introduced ones don't have any bill
      >>            text even though the text exists on the CPO website.
      >>         >
      >>         > Perhaps a scraper is down?
      >>         >
      >>         > Thanks,
      >>         > Andy
      >>         > OpenCongress.org
      >>         >
      >>         >
      >>         >
      >>         > ------------------------------__------
      >>         >
      >>         > Yahoo! Groups Links
      >>         >
      >>         >
      >>         >
      >>
      >>
      >>            ------------------------------__------
      >>
      >>
      >>            Yahoo! Groups Links
      >>
      >>        <http://groups.yahoo.com/group/govtrack/>
      >>
      >>        <http://groups.yahoo.com/group/govtrack/join>
      >>                (Yahoo! ID required)
      >>
      >>        <mailto:govtrack-digest@__yahoogroups.com
      >>        <mailto:govtrack-digest@yahoogroups.com>>
      >>        govtrack-fullfeatured@__yahoogroups.com
      >>        <mailto:govtrack-fullfeatured@yahoogroups.com>
      >>        <mailto:govtrack-fullfeatured@__yahoogroups.com
      >>
      >>        <mailto:govtrack-fullfeatured@yahoogroups.com>>
      >>
      >>
      >>        <mailto:govtrack-unsubscribe@__yahoogroups.com
      >>
      >>        <mailto:govtrack-unsubscribe@yahoogroups.com>>
      >>
      >>
      >>        <http://docs.yahoo.com/info/terms/>
      >>
      >>
      >>
      >>
      >>        --
      >>        Developer | sunlightfoundation.com
      >>        <http://sunlightfoundation.com> <http://sunlightfoundation.com__>
      >>
      >>
      >>
      >>
      >>
      >>
      >>
      >>
      >> --
      >> Developer | sunlightfoundation.com <http://sunlightfoundation.com>
      >>
      >>
      >>
      >>



      --
      Developer | sunlightfoundation.com
    Your message has been successfully submitted and would be delivered to recipients shortly.