Loading ...
Sorry, an error occurred while loading the content.

Re: [govtrack] Bill text problems?

Expand Messages
  • Eric Mill
    I use a combination of three files for each bill. Primarily, the .txt, for the text. I m only storing the text en masse for full text search, not storing the
    Message 1 of 9 , Nov 29, 2011
    • 0 Attachment
      I use a combination of three files for each bill. Primarily, the .txt, for the text. I'm only storing the text en masse for full text search, not storing the semantic hierarchy of the bill. Secondarily, I use the MODS XML metadata to get what date the bill version was issued on, a pretty critical piece of data. However, sometimes the MODS file doesn't exist, and I use the .xml (HTML) version of the bill as a backup source for the issued date -- which, now that I look at the code, makes use of the Dublin Core metadata that you add on top of the original bill data. I don't make use of the PDF.

      My code that does all this is here, btw:

      I understand that this is less vital, but I mean it when I say the rsync is incredibly useful -- so much so that if you left it offline, what I'd probably do is set up a separate dedicated GPO bulk data mirroring service for at least bill text, that supported rsync, and use that internally. That's a lot of work, though! If you're continuing to use the GPO's bill text files in your own work on POPVOX, you'd do the community a service by continuing to make that work available.

      -- Eric

      On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...> wrote:
      the bill text is a less vital service, since you just
      repackage what GPO offers

      Exactly. That's why I'm not particularly concerned about dropping this since it doesn't do much to begin with and after 5+ years of running the bill text scraper it's past time to rethink what's useful. (Btw, it does also scrape the HTML bill text on THOMAS, which is slightly less trivial, but still pretty trivial.)

      Do you use the PDFs or HTML (or .txt?)?

      (Clearly when I said "free loading" I was not referring to what I agree is a simple repackaging of PDFs.)


      - Josh Tauberer
      - GovTrack.us / POPVOX.com

      http://razor.occams.info | www.govtrack.us | www.popvox.com

      On 11/29/2011 09:30 AM, Eric Mill wrote:


      I make use of the bill text that GovTrack provides in Sunlight's data
      services (our Real Time Congress API) and in the apps that depend on it
      (including our Congress app). We load it into ElasticSearch
      (recommended, btw) and we power our search and highlighting with it. I'm
      imminently about to document this full text search capability and offer
      it to the public.

      Unlike bill metadata, where you've done God's work and scrapes THOMAS
      all day every day, the bill text is a less vital service, since you just
      repackage what GPO offers and provide it via rsync. This is an
      incredibly useful way to provide it though! I'd like it to stick around.

      I'm not sure it's possible to "free ride" on free, CC0-licensed,
      repackaged versions of public domain government data. If you feel like
      people have been insufficiently thankful for your work or haven't given
      enough attribution, that is a more valid and specific conversation to
      have than accusing folks who are asking about the status of your public
      data on your public mailing list of competing with your business.

      -- Eric

      On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer <tauberer@...
      <mailto:tauberer@...>> wrote:

         I've been meaning to write about this.

         About two weeks ago GPO stopped updating GPO Access, which was their
         system for publishing documents since the mid 90s. New bills and other
         documents are only being published in FDSys now, and GovTrack isn't
         pulling from FDSys because FDSys didn't exist when I wrote the bill text
         scraper.

         Since I've been focused on POPVOX lately, I haven't had a chance to
         build a new scraper for GovTrack, although in anticipation of this I've
         been working on reimplementing much of the same functionality on POPVOX.
         I'm not sure what if any of that code will be open, though we have an
         experimental API for it now.

         It would be helpful to know who else, if anyone, is using bill text so I
         can plan the future of GovTrack's bill text accordingly.

         But I will say that folks free riding on my data and using it to compete
         with my business (i.e. POPVOX) get no sympathy from me.

         - Josh Tauberer
         - GovTrack.us / POPVOX.com

         http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
         | www.popvox.com <http://www.popvox.com>


         On 11/29/2011 02:12 AM, jlundigard wrote:
          > Hey all,
          >
          > We've noticed the we stopped receiving bill text from govtrack.
           It seems to have stopped around this bill:
          >
          > http://www.govtrack.us/congress/bill.xpd?bill=s112-1788
          >
          > That bill and more recently introduced ones don't have any bill
         text even though the text exists on the CPO website.
          >
          > Perhaps a scraper is down?
          >
          > Thanks,
          > Andy
          > OpenCongress.org
          >
          >
          >
          > ------------------------------------
          >
          > Yahoo! Groups Links
          >
          >
          >


         ------------------------------------

         Yahoo! Groups Links

         <*> To visit your group on the web, go to:
         http://groups.yahoo.com/group/govtrack/

         <*> Your email settings:
             Individual Email | Traditional

         <*> To change settings online go to:
         http://groups.yahoo.com/group/govtrack/join
             (Yahoo! ID required)

         <*> To change settings via email:
         govtrack-digest@yahoogroups.com <mailto:govtrack-digest@yahoogroups.com>
         govtrack-fullfeatured@yahoogroups.com
         <mailto:govtrack-fullfeatured@yahoogroups.com>


         <*> To unsubscribe from this group, send an email to:
         govtrack-unsubscribe@yahoogroups.com
         <mailto:govtrack-unsubscribe@yahoogroups.com>


         <*> Your use of Yahoo! Groups is subject to:
         http://docs.yahoo.com/info/terms/




      --
      Developer | sunlightfoundation.com <http://sunlightfoundation.com>







      --

    • Josh Tauberer
      Hi, everyone. Bill text is updating now. Thanks to whoever here forwarded the problem on to GPO --- I got an email from someone at GPO who pointed me to their
      Message 2 of 9 , Dec 10, 2011
      • 0 Attachment
        Hi, everyone.

        Bill text is updating now.

        Thanks to whoever here forwarded the problem on to GPO --- I got an
        email from someone at GPO who pointed me to their sitemap files, e.g.:
        http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_BILLS_sitemap.xml
        (warning: BIG file). I'm checking on how bills are split by year (by
        publication date?), but this seems to be the most helpful way to find
        them all.

        Btw, Eric- For indexing bill text, it might be better to use the
        original text files from GPO. The .txt files on GovTrack are generated
        using pdftotext and have line numbers, whereas the GPO original .txt
        files do not (I imagine they are generated from the XML or GPO locator
        codes files directly).

        I don't use my own .txt files except to display historical bill text,
        and unless there's an objection I could replace the pdftotext-generated
        files with the GPO original .txt files.

        Any objections from anyone?

        - Josh Tauberer
        - GovTrack.us / POPVOX.com

        http://razor.occams.info | www.govtrack.us | www.popvox.com

        On 11/29/2011 10:25 AM, Eric Mill wrote:
        >
        >
        > I use a combination of three files for each bill. Primarily, the .txt,
        > for the text. I'm only storing the text en masse for full text search,
        > not storing the semantic hierarchy of the bill. Secondarily, I use the
        > MODS XML metadata to get what date the bill version was issued on, a
        > pretty critical piece of data. However, sometimes the MODS file doesn't
        > exist, and I use the .xml (HTML) version of the bill as a backup source
        > for the issued date -- which, now that I look at the code, makes use of
        > the Dublin Core metadata that you add on top of the original bill data.
        > I don't make use of the PDF.
        >
        > My code that does all this is here, btw:
        > https://github.com/sunlightlabs/realtimecongress/blob/master/tasks/bill_text_archive/bill_text_archive.rb
        >
        > I understand that this is less vital, but I mean it when I say the rsync
        > is incredibly useful -- so much so that if you left it offline, what I'd
        > probably do is set up a separate dedicated GPO bulk data mirroring
        > service for at least bill text, that supported rsync, and use that
        > internally. That's a lot of work, though! If you're continuing to use
        > the GPO's bill text files in your own work on POPVOX, you'd do the
        > community a service by continuing to make that work available.
        >
        > -- Eric
        >
        > On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...
        > <mailto:tauberer@...>> wrote:
        >
        > the bill text is a less vital service, since you just
        > repackage what GPO offers
        >
        >
        > Exactly. That's why I'm not particularly concerned about dropping
        > this since it doesn't do much to begin with and after 5+ years of
        > running the bill text scraper it's past time to rethink what's
        > useful. (Btw, it does also scrape the HTML bill text on THOMAS,
        > which is slightly less trivial, but still pretty trivial.)
        >
        > Do you use the PDFs or HTML (or .txt?)?
        >
        > (Clearly when I said "free loading" I was not referring to what I
        > agree is a simple repackaging of PDFs.)
        >
        >
        > - Josh Tauberer
        > - GovTrack.us / POPVOX.com
        >
        > http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
        > | www.popvox.com <http://www.popvox.com>
        >
        > On 11/29/2011 09:30 AM, Eric Mill wrote:
        >
        >
        >
        > I make use of the bill text that GovTrack provides in Sunlight's
        > data
        > services (our Real Time Congress API) and in the apps that
        > depend on it
        > (including our Congress app). We load it into ElasticSearch
        > (recommended, btw) and we power our search and highlighting with
        > it. I'm
        > imminently about to document this full text search capability
        > and offer
        > it to the public.
        >
        > Unlike bill metadata, where you've done God's work and scrapes
        > THOMAS
        > all day every day, the bill text is a less vital service, since
        > you just
        > repackage what GPO offers and provide it via rsync. This is an
        > incredibly useful way to provide it though! I'd like it to stick
        > around.
        >
        > I'm not sure it's possible to "free ride" on free, CC0-licensed,
        > repackaged versions of public domain government data. If you
        > feel like
        > people have been insufficiently thankful for your work or
        > haven't given
        > enough attribution, that is a more valid and specific
        > conversation to
        > have than accusing folks who are asking about the status of your
        > public
        > data on your public mailing list of competing with your business.
        >
        > -- Eric
        >
        > On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer
        > <tauberer@... <mailto:tauberer@...>
        > <mailto:tauberer@... <mailto:tauberer@...>>> wrote:
        >
        > I've been meaning to write about this.
        >
        > About two weeks ago GPO stopped updating GPO Access, which
        > was their
        > system for publishing documents since the mid 90s. New bills
        > and other
        > documents are only being published in FDSys now, and
        > GovTrack isn't
        > pulling from FDSys because FDSys didn't exist when I wrote
        > the bill text
        > scraper.
        >
        > Since I've been focused on POPVOX lately, I haven't had a
        > chance to
        > build a new scraper for GovTrack, although in anticipation
        > of this I've
        > been working on reimplementing much of the same
        > functionality on POPVOX.
        > I'm not sure what if any of that code will be open, though
        > we have an
        > experimental API for it now.
        >
        > It would be helpful to know who else, if anyone, is using
        > bill text so I
        > can plan the future of GovTrack's bill text accordingly.
        >
        > But I will say that folks free riding on my data and using
        > it to compete
        > with my business (i.e. POPVOX) get no sympathy from me.
        >
        > - Josh Tauberer
        > - GovTrack.us / POPVOX.com
        >
        > http://razor.occams.info | www.govtrack.us
        > <http://www.govtrack.us> <http://www.govtrack.us>
        > | www.popvox.com <http://www.popvox.com> <http://www.popvox.com>
        >
        >
        > On 11/29/2011 02:12 AM, jlundigard wrote:
        > > Hey all,
        > >
        > > We've noticed the we stopped receiving bill text from govtrack.
        > It seems to have stopped around this bill:
        > >
        > > http://www.govtrack.us/__congress/bill.xpd?bill=s112-__1788
        > <http://www.govtrack.us/congress/bill.xpd?bill=s112-1788>
        > >
        > > That bill and more recently introduced ones don't have any bill
        > text even though the text exists on the CPO website.
        > >
        > > Perhaps a scraper is down?
        > >
        > > Thanks,
        > > Andy
        > > OpenCongress.org
        > >
        > >
        > >
        > > ------------------------------__------
        > >
        > > Yahoo! Groups Links
        > >
        > >
        > >
        >
        >
        > ------------------------------__------
        >
        > Yahoo! Groups Links
        >
        >
        > (Yahoo! ID required)
        >
        > <mailto:govtrack-digest@__yahoogroups.com
        > <mailto:govtrack-digest@yahoogroups.com>>
        > govtrack-fullfeatured@__yahoogroups.com
        > <mailto:govtrack-fullfeatured@yahoogroups.com>
        > <mailto:govtrack-fullfeatured@__yahoogroups.com
        > <mailto:govtrack-fullfeatured@yahoogroups.com>>
        >
        >
        > <mailto:govtrack-unsubscribe@__yahoogroups.com
        > <mailto:govtrack-unsubscribe@yahoogroups.com>>
        >
        >
        >
        >
        >
        >
        > --
        > Developer | sunlightfoundation.com
        > <http://sunlightfoundation.com> <http://sunlightfoundation.com__>
        >
        >
        >
        >
        >
        >
        >
        > --
        > Developer | sunlightfoundation.com <http://sunlightfoundation.com>
        >
        >
        >
        >
      • Eric Mill
        I ve been looking exactly for sitemap files like that! Would you mind sharing how we can find the different sitemaps? For example, I guessed at the URL for the
        Message 3 of 9 , Dec 10, 2011
        • 0 Attachment
          I've been looking exactly for sitemap files like that! Would you mind
          sharing how we can find the different sitemaps?

          For example, I guessed at the URL for the one for public and private laws:
          http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_PLAW_sitemap.xml

          But that file is very small and doesn't list what you would need to
          effectively spider the PLAW collection without scraping their HTML.

          As for text of bills -- I actually came to that realization yesterday
          myself, that the GPO .txt files were probably better. I definitely
          would not mind you switching over to them - I can adjust my regular
          expressions (just for sanitization, not extracting data) accordingly.

          -- Eric

          On Sat, Dec 10, 2011 at 12:51 PM, Josh Tauberer <tauberer@...> wrote:
          > Hi, everyone.
          >
          > Bill text is updating now.
          >
          > Thanks to whoever here forwarded the problem on to GPO --- I got an email
          > from someone at GPO who pointed me to their sitemap files, e.g.:
          > http://www.gpo.gov/smap/fdsys/sitemap_2011/2011_BILLS_sitemap.xml (warning:
          > BIG file). I'm checking on how bills are split by year (by publication
          > date?), but this seems to be the most helpful way to find them all.
          >
          > Btw, Eric- For indexing bill text, it might be better to use the original
          > text files from GPO. The .txt files on GovTrack are generated using
          > pdftotext and have line numbers, whereas the GPO original .txt files do not
          > (I imagine they are generated from the XML or GPO locator codes files
          > directly).
          >
          > I don't use my own .txt files except to display historical bill text, and
          > unless there's an objection I could replace the pdftotext-generated files
          > with the GPO original .txt files.
          >
          > Any objections from anyone?
          >
          >
          > - Josh Tauberer
          > - GovTrack.us / POPVOX.com
          >
          > http://razor.occams.info | www.govtrack.us | www.popvox.com
          >
          > On 11/29/2011 10:25 AM, Eric Mill wrote:
          >>
          >>
          >>
          >> I use a combination of three files for each bill. Primarily, the .txt,
          >> for the text. I'm only storing the text en masse for full text search,
          >> not storing the semantic hierarchy of the bill. Secondarily, I use the
          >> MODS XML metadata to get what date the bill version was issued on, a
          >> pretty critical piece of data. However, sometimes the MODS file doesn't
          >> exist, and I use the .xml (HTML) version of the bill as a backup source
          >> for the issued date -- which, now that I look at the code, makes use of
          >> the Dublin Core metadata that you add on top of the original bill data.
          >> I don't make use of the PDF.
          >>
          >> My code that does all this is here, btw:
          >>
          >> https://github.com/sunlightlabs/realtimecongress/blob/master/tasks/bill_text_archive/bill_text_archive.rb
          >>
          >> I understand that this is less vital, but I mean it when I say the rsync
          >> is incredibly useful -- so much so that if you left it offline, what I'd
          >> probably do is set up a separate dedicated GPO bulk data mirroring
          >> service for at least bill text, that supported rsync, and use that
          >> internally. That's a lot of work, though! If you're continuing to use
          >> the GPO's bill text files in your own work on POPVOX, you'd do the
          >> community a service by continuing to make that work available.
          >>
          >> -- Eric
          >>
          >> On Tue, Nov 29, 2011 at 10:00 AM, Josh Tauberer <tauberer@...
          >> <mailto:tauberer@...>> wrote:
          >>
          >>        the bill text is a less vital service, since you just
          >>        repackage what GPO offers
          >>
          >>
          >>    Exactly. That's why I'm not particularly concerned about dropping
          >>    this since it doesn't do much to begin with and after 5+ years of
          >>    running the bill text scraper it's past time to rethink what's
          >>    useful. (Btw, it does also scrape the HTML bill text on THOMAS,
          >>    which is slightly less trivial, but still pretty trivial.)
          >>
          >>    Do you use the PDFs or HTML (or .txt?)?
          >>
          >>    (Clearly when I said "free loading" I was not referring to what I
          >>    agree is a simple repackaging of PDFs.)
          >>
          >>
          >>    - Josh Tauberer
          >>    - GovTrack.us / POPVOX.com
          >>
          >>    http://razor.occams.info | www.govtrack.us <http://www.govtrack.us>
          >>    | www.popvox.com <http://www.popvox.com>
          >>
          >>    On 11/29/2011 09:30 AM, Eric Mill wrote:
          >>
          >>
          >>
          >>        I make use of the bill text that GovTrack provides in Sunlight's
          >>        data
          >>        services (our Real Time Congress API) and in the apps that
          >>        depend on it
          >>        (including our Congress app). We load it into ElasticSearch
          >>        (recommended, btw) and we power our search and highlighting with
          >>        it. I'm
          >>        imminently about to document this full text search capability
          >>        and offer
          >>        it to the public.
          >>
          >>        Unlike bill metadata, where you've done God's work and scrapes
          >>        THOMAS
          >>        all day every day, the bill text is a less vital service, since
          >>        you just
          >>        repackage what GPO offers and provide it via rsync. This is an
          >>        incredibly useful way to provide it though! I'd like it to stick
          >>        around.
          >>
          >>        I'm not sure it's possible to "free ride" on free, CC0-licensed,
          >>        repackaged versions of public domain government data. If you
          >>        feel like
          >>        people have been insufficiently thankful for your work or
          >>        haven't given
          >>        enough attribution, that is a more valid and specific
          >>        conversation to
          >>        have than accusing folks who are asking about the status of your
          >>        public
          >>        data on your public mailing list of competing with your business.
          >>
          >>        -- Eric
          >>
          >>        On Tue, Nov 29, 2011 at 8:27 AM, Josh Tauberer
          >>        <tauberer@... <mailto:tauberer@...>
          >>        <mailto:tauberer@... <mailto:tauberer@...>>> wrote:
          >>
          >>            I've been meaning to write about this.
          >>
          >>            About two weeks ago GPO stopped updating GPO Access, which
          >>        was their
          >>            system for publishing documents since the mid 90s. New bills
          >>        and other
          >>            documents are only being published in FDSys now, and
          >>        GovTrack isn't
          >>            pulling from FDSys because FDSys didn't exist when I wrote
          >>        the bill text
          >>            scraper.
          >>
          >>            Since I've been focused on POPVOX lately, I haven't had a
          >>        chance to
          >>            build a new scraper for GovTrack, although in anticipation
          >>        of this I've
          >>            been working on reimplementing much of the same
          >>        functionality on POPVOX.
          >>            I'm not sure what if any of that code will be open, though
          >>        we have an
          >>            experimental API for it now.
          >>
          >>            It would be helpful to know who else, if anyone, is using
          >>        bill text so I
          >>            can plan the future of GovTrack's bill text accordingly.
          >>
          >>            But I will say that folks free riding on my data and using
          >>        it to compete
          >>            with my business (i.e. POPVOX) get no sympathy from me.
          >>
          >>            - Josh Tauberer
          >>            - GovTrack.us / POPVOX.com
          >>
          >>        http://razor.occams.info | www.govtrack.us
          >>        <http://www.govtrack.us> <http://www.govtrack.us>
          >>            | www.popvox.com <http://www.popvox.com>
          >> <http://www.popvox.com>
          >>
          >>
          >>
          >>            On 11/29/2011 02:12 AM, jlundigard wrote:
          >>         > Hey all,
          >>         >
          >>         > We've noticed the we stopped receiving bill text from govtrack.
          >>              It seems to have stopped around this bill:
          >>         >
          >>         > http://www.govtrack.us/__congress/bill.xpd?bill=s112-__1788
          >>
          >>        <http://www.govtrack.us/congress/bill.xpd?bill=s112-1788>
          >>         >
          >>         > That bill and more recently introduced ones don't have any bill
          >>            text even though the text exists on the CPO website.
          >>         >
          >>         > Perhaps a scraper is down?
          >>         >
          >>         > Thanks,
          >>         > Andy
          >>         > OpenCongress.org
          >>         >
          >>         >
          >>         >
          >>         > ------------------------------__------
          >>         >
          >>         > Yahoo! Groups Links
          >>         >
          >>         >
          >>         >
          >>
          >>
          >>            ------------------------------__------
          >>
          >>
          >>            Yahoo! Groups Links
          >>
          >>        <http://groups.yahoo.com/group/govtrack/>
          >>
          >>        <http://groups.yahoo.com/group/govtrack/join>
          >>                (Yahoo! ID required)
          >>
          >>        <mailto:govtrack-digest@__yahoogroups.com
          >>        <mailto:govtrack-digest@yahoogroups.com>>
          >>        govtrack-fullfeatured@__yahoogroups.com
          >>        <mailto:govtrack-fullfeatured@yahoogroups.com>
          >>        <mailto:govtrack-fullfeatured@__yahoogroups.com
          >>
          >>        <mailto:govtrack-fullfeatured@yahoogroups.com>>
          >>
          >>
          >>        <mailto:govtrack-unsubscribe@__yahoogroups.com
          >>
          >>        <mailto:govtrack-unsubscribe@yahoogroups.com>>
          >>
          >>
          >>        <http://docs.yahoo.com/info/terms/>
          >>
          >>
          >>
          >>
          >>        --
          >>        Developer | sunlightfoundation.com
          >>        <http://sunlightfoundation.com> <http://sunlightfoundation.com__>
          >>
          >>
          >>
          >>
          >>
          >>
          >>
          >>
          >> --
          >> Developer | sunlightfoundation.com <http://sunlightfoundation.com>
          >>
          >>
          >>
          >>



          --
          Developer | sunlightfoundation.com
        Your message has been successfully submitted and would be delivered to recipients shortly.